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L BACKGROUMKPF THE INVENTION 
20 L Field of the Invention: 

The present invention relatesto theg^er^fidd<rf.bioGhemic^assay^^nd separations, and to 
apparatus for their practice, generally classified in U. S. Patent Class 435/6. 

H. Description of the Prior Art 



25 

Unlike multicellular organisms, bacteria and simple eukasyotic microcHrganisms have very limited 
morphological diversity and typically do not leave a significant fossil record It therefore was initially very 
difficult to develop a classification system, which reflects actual genetic relationship. Instead, classic 
bacterial taxonomic methods, such as morphology and carbon source utilization were used to classify 

30 bacteria in a deterministic way. The goal was to develop a hierarchy c£ tests that ultimately could 

reproducibly assign a consistent name to an unknown isolate. When organisms gave very similar results on 
the various tests they wouldultimately be assigned to the same species regardless of actual genetic 
relationship Thus, organisms were sometimes grouped together that were fundamentally very different 
This situation changed dramatically in the!970's due to the pioneering work of Carl Woese and 

35 his colleagues. In order to obtain a genotypic classification, methods based on molecular sequence analysis 
of ribosomal RNA (rRNA) were developed The rRNAs offered the. advantage of being found in all 
organisms and the equivalent molecules could be readily isolated and purified from essentially any 
organism. The large ribosomal RNAs vary in length^dependingoathe onanism and therefore have 
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different names, e.g. 16S rRNA, 18S rRNA etc, depending on the organism under consideration. To avoid 
this difficulty, the terminology small subunit RNA (SSURNA) and large subunit RJNjA (LSURNA) is used 
to specify any of the RNAS belonging to each class. Among the rRNAs, 5S rRNA with approximately 120 
nucleotides was thought to be too short to be useful and the LSU RNA, (23S rR^A in bacteria), would 
have been far more difficult to work with. Attention therefore focused on the SSU RNA (16S rRNA in 
bacteria). 16S rRNA is a major component of the bacterial small ribosomai subunit It consists of 

approximately 1,550 ribonucleotides in Escherichia coli and has an intricate secondary structure featuring 
extensive intrachain base pairing, The detailed three-dimensional folding of 16$ rRNA in the Thermits 
aquaticus 3 OS ribosomai subunit has recently been determined by X-ray crystallography. As a major 
component of the ribosome, 16S rRNA interacts with 23S rRNA taestabiish-the pverall geometry of the 
ribosome and is directly involved in the initiation of protein biosynthesis by ribosomes. 

When Woese first began usin&16S rRNA in his evolutionary studies it w^s not technically 
feasible to sequence the entire RNA. Therefore a characterization approach was developed (Uchida e t ai, 9 
1974) in which the 16S rRNA was fragmented by the nuclease^ ribonuclease T^Tl^is enzyme cleaves the 
RNA at guanosine (G) residues and thereby reduced the RNA to a collection of fragments of various 
lengths with a single terminal G The non-G portion of the fragment was then sequenced. The lists of all 
such fragments obtained from a single RNA was referred to as a catalog. Catalogs of ribonuclease Ti 
fragments from 16S rRNAs isolated from a variety of organisms were compared to pne another and cluster 
analysis was used to construct a tree of relationship between the various bacteria (Box et at, 1977). By 
1980, enough data of thisJype.had accumulated that it w^possibk to construa the first trees that seriously 
attempted to identify the actual historical relationships between the various types of bacteria (Fox et al. 9 
1980; Woese, 1987). 

Later, as sequencing, techiiolo^ was improved, it became passible tasequenca^nd compare entire 16S 
rRNAs. 

In an effort to better understand the tree ^oducedbydustet analysis, aji alternative means of 
examining relationships known as "signature analysis" was developed (Woese et crf. 7 1980). It was 
observed that certain of the ribcttiucleaseT ^fragments^vwre only found in a subset o£ the 16S rRNA 
catalogs. Frequently there was more than one such sequence that was uniquely found jn the same group of 
organisms. Thus, the term "signatui^wasintr^^ that is 

characteristic of (unique to) a group of organisms defines that group and is a "signature" for the group". 
These signatures suggestedthat there- was^a relationship-between the-Gigaai&ms-in the group and so the tree 
was examined to see if the tree-generating algorithm had in fact found the expected relationship. 

This process of checking, the reasonableness of trees producedfrom the cataloging data was 
employed on several occasions (Woese et of., 1980; Woese et aJ„ 1984; McGill et al. 9 1986). In its final 
rendition, (McGill et al^ 19S6)- the- notion of a signature quality- index that couldl^ calculated for every 
individual RNAse Ti oligonucleotide was introduced as a means of formalizing the extent to which there 
was or was not a signature for each branch in the tree. 
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Today, comparison of I6S rRNA sequences is widely used to establish the genetic relationship 
between bacteria A typical approach is to amplify and sequence 16S rDNA from various prokaryotic 
organisms. The resulting sequences are aligned with other 16S rRNA sequences and an appropriate 
method, e.g maximum likelihood, is used to construct a tree that reflects likely historical relationships. 
Several public databases exist containing complete and partial small subunit rRNA sequences. For 
example, release 8 of the ROP database (Maidak et a/., 2000) includes^datafbr the small subunit RNA from 
over i 6,000 bacteria, eukaiyotes, plastids and mitochondria 

As Woese's^warkbecame. well-known it began to be appreciate&that rRNA might be useful in 
detecting the presence of a target organism in a test sample. Thus, in I9S0 Kohne applied for patents (US 
patent 4,851,330 granted2S Ait^l98Sand5 ,288,61 Lgrairted-2/22/1594)-thejessenqe of which is that a 
nucleic acid probe that is complementary to the rRNA of a specific target can be used to detect the presence 
of that target. This core- approach lias been widely used in microbial identification with probes usually 
being devised by sequence comparison rather than Kohne's preferred embodimerit that was subtractive 
hybridization. Several commerciaLproducts rely on this approach. 

The invention described here provides a novel approach for rapidly determining the genetic 
affinity of organisms in a test sample. The invention' s methodology isJar jnore general than the 
specifically targeted tests of the Kohne approach, and faster and more convenient than detailed sequencing 
of the rRNAs or their encodmg_DNA. The method of this invention is currently rn^st readily utilized with 
16S rRNA sequence data but can be adapted to other data sets such as rRNA spacers, RNAse P RNA, 
genomic DNA or RNA of viruses, etc. One begins.by definingim a phylogenetic tree 

that includes the organism range of interest, e.g. ail bacteria for example. Then a set of characteristic 

oligonucleotides, each of which ldentifies^agroup in. the. phylogeiieUc tree, is-determined according to a 

i 

newly developed algorithm of the invention, lhis set of signature oligonucleotides is utilized in a 
hybridization experiment, ^^DNA^n^o^tay. the resuksof which ate thenused to quickly identify the 
phylogenetic neighborhood of a problematic bacterium, or other microorganism. These hybridization 
experiments can be miniaturized sathat rmmnialiy trained personnel can readily conduct them in difficult 
environments. The set of signature oligonucleotides can be updated and redesigned as our knowledge of the 
true genetic affinity between known organisnis improves. In many cases,, the hybridization array will be 
able to determine the genetic affinity of multiple organisms in a sample in one_experiment If the organism 
turns out to be a previou&lyJaiawn.G^ be determined tathe ^pecies level if suitable 

signature oligonucleotides are included in the hybridization. Under some circumstances, the signature 
sequences can also be usedin assays in which detection doesnot rely on hybridization. 

Problem Solved by the InventionrThe Kohne patents (below) teach methods to utilize probes to detect 
specific predeterrnined^sr g a nis m s xx groups^af organisms. Thus,. the c 61 1 patent teaches us how to 
determine if a particular species of organism is or is not present in a test sample. The v 330 patent teaches us 
how to detect specific groups of organisms, as well as individual organisms. It is somewhat limited, 

! 

i 
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however, in that the probes under this invention are obtained by selection; ie. subtractive hybridization. 
Others have subsequently demonstrated the ability to detect specific groups using probes based on 
sequence comparisons. 



5 It is implicit in all these, prior art references that one knows what one is lookingibr. Thus, a prior art test 
can be specifically designed for detecting Legionella. However, this is not always what is needed, e.g a 
quick response might be necessary to respond to an outbreak of a previously unknown transmissible 
microbial disease. Perhaps even more to the point in this day and age, a terrorist could bioengineer a 
normally harmless organism to carry a gene that results in production of a deadly toxin. The resulting 

1 0 organism would have properties not normally associated with the bacterium that carries the toxin gene. 

Indeed, the organism itself might be from a previously unknown genus. Similarly, th^re are instances where 
work is done in remote locations such as the Antarctic or on the International Space Station where one has 
extremely limited diagnostic capability available. Even in standard medical practice microbial 
identification is needlessly cumbersome in that many alternative specialized tests are now used to identify 

15 the presence of the various known pathogens. Irt all of these cases, the ability to genetically characterize 
and hence identify what organisms or viruses are present in a test sample with a single universal test system 
would be invaluable. The invention provides this badly needed solution in a very general way. 



3 
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SUMMARY OF THE INVENTION 
Applicants' method is summarized as follows: 

A. Establish or otherwise obtain a nucleic acid sequence database of the equivalent nucleic acid from 
a variety of organisms. It is best to quality control the database; selecting sequences, which are complete 
and lack unknown segments in the region of interest, discarding the rest. Any of a variety of nucleic acid 
sequences is potentially useful. At present the substantial amount of sequence information available for 



010AUS of USPTO Custom No, 268-30- 



5/61 



rRNAs, especially the SSU rRNA (i.e. 16S rRNA in bacteria) makes that molecule aii excellent choice for 
bacteria and eukaryotic inicroorgauisms. In tlie case of viruse& the most proinising^ource of information is 
currently the sequence of the genomic DNA or SNA. 



B. Obtain or develop^bifiircaimg4i^ refects the genetic 

relationships between the organisms or viruses whose sequences are included in the nucleic acid sequence 
database that is to be used 



C. Choose a smallest sequence length of interest for the characteristic sequence^, which will be sought 
This length will differ depending in on the length of the nucleic acid molecule or region being examined, 
the number of sequences in the dataset and various constraints by the experimental systems that will be 
used. 



D. Test all possible sequences of this length N against the entries in thq nucleic acid sequence 
database that is being used in conjunction with the tree. A signature quality function such as Qs is 
calculated for every possible sequence of length N at each node in the tree. It is~preferabie and 
computationally efficient to only calculate the Qs value for test sequences of length N that occur at least 
twice in the database. Those test sequences that never occur are not signature sequences. Test sequences 
that occur once are perfect signature sequences of the particular organism or virus from which the nucleic 
acid was obtained. The signature quality function can be defined in ^variety of way$ but should be 
constructed so as to determine the extent to which a test sequence of length N is found in all the organisms 
in the database belonging^) the set of sequences represented by a node in the tree and not found elsewhere. 
A particular test sequence is determined to be a perfect signature of the organisms represented by a 
particular bifurcation node on the phyiogenetic tree if all the nucleic acid sequences represented by that 
node contain the sequence and the sequence is not found in any nucleic acid sequence not represented by 
that node. A value Qs between zero (no signature value) and one (perfect signature) is obtained for each 
test sequence at each node. 
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E. Retain as signature sequences those test sequences having Q s above some criterion. A given node 
may encompass many signature sequences. Likewise, a particular test sequence <mbe a signature 
encompassed by more than one node, though frequently with differing values of Qs. This reflects the child, 
parent, grandparent, etc. relationship between bifurcation nodes j>n a phylogenetic tree. 

F. Optionally, Repeat the steps D and E for sequences of the desired lerigth (e.g, 7mers, then 
8mers,etc). 



G. The signature sequences permit the design of hybridization probes for use iri an assay. A typical 
assay can employ a plurality of such signature probes representing at least 50%, and typically more, of the 
nodes in the applicable phylogenetic tree. The resulting hybridization will allow the identification of the 
organism's genetic affinity without the necessity of prior knowledge of what it would be. It is contemplated 
that this invention can allow the development of a single test system that can boused to identify a wide 
variety of organisms. 



K Once available, the signaUir^sequftnce s cart be used in other ways. Eor example, it is preferable to 
detect the presence of specific signature sequences in a sample using mass spectrometry. It is also 
preferable to use signature sequences. to design PCR primers far a^ariety of applications. 

In abstract form the invention maybe described as follows: 



Selecting which subsequences in a database of nucleic acid such a^ 1 6S rRNA are highly 
characteristic of particular groupings of bacteria, rrucroorganisms, fungi, etc. on a substantially 
phylogenetic tree. The invention is also amicable to viruses comprising.viral genomic RNA or DNA. A 
catalogue of highly characteristic signature sequences identified by this method is assembled to establish 
the genetic identity cfmomknawa organism. The signature sequences are usedto design nucleic acid 
hybridization probes that include the characteristic sequence or its complement, or are derived from one or 
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more characteristic sequences. A plurality of these signature sequences is used in hybridization to 
determine the phylogenetic tree position of the organism(s) in a sample. If the target organism is 
represented in the original sequence database and the signature sequences can identify it to the species or 
possibly subspecies ievel. Oligonucleotide arrays of many probes are especially preferred. A hybridization 
signal can comprise fluorescence, chemiluminescence, or isotopic labeling, etc, ; or sequences in a sample 
can be detected by direct means, e.g mass spectrometry. The method's characteristic sequences can also be 
used to design specific PCR primers. The method uniquely identifies the phylogenetic affinity of an 
unknown organism without requiringprior knowledge of what is present in the sapiple. Even if the 
organism has not been previously encountered, the method still provides useful information about which 
phylogenetic tree bifurcatioa4iodes^Dompass the organism. 

DETAILED DESCRIPTION OF INVENTION 

Brief Description of the Several Views of the Drawings: 

Figure 1 shows schematically the bi-directional binary tree^ structure. 

Figure 2 shows schematically the structure of the composite hash of the oligonucleotides. 

Figure 3shows schematically thefLow chart of the principal programs. 

Figure 4 shows schematically how Subsystem I converts the format of the sequence file. 

Figure 5 shows schematically a phylagenetiatree.and.its corresponding Newick format presentation. 

Figure 6 shows schematically the tree file in Newick format is parsed in a stepwise and bottom-up maimer 

Figure 7 shows schematically tlie trimming is stepwise and topology-conserving 

Figure 8 shows schematically the composite hash of the oligonucleotides is buiit from the I6S rRNA 

sequences 

Figure 9 shorn schematically how the number of oligonucleotides and their respective lengths length are 
related 

Figure 10 shows the rqaBsentative^prokaiyotic phylogenetic ^ree in Newick format 

Figure 11 shows a graphic view of the representative prokarydtic phylogenetic tree. 

Figure 12 A local region of the representative tree following trimmingfrom 38 to \2 sequences. The 

branch numbers in the representative tree are labeled in the picture and can be correlated with the results 

given in Table F. The complete representative tree is given in Newick format in figure 10 and shown in 

graphical form on the CD that is part of this application 

Table A illustrates by example certain information, which is on the CD that is part pf this application. The 
table illustrates for test sequences of length 15 the five best signature quality scores and the nodes they are 
associated with in the phylogenetic tree. 

Complete lists of this type are on the CD for a several different sequence lengths. 
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Table B illustrates by example certain information, which is on the CD that is part of this application. The 
table illustrates signature sequences of length. 12 thatare completely unique tp the organisms that is 
indicated 

Table C shows the subsystems of the programs used and their functions and components. 
Table D shows the numbers of possible oligonucleotides of different lengths 

Table E shows a the number of signature sequences that were found at various quality levels as a function 
of length.. 

Table F shows the p:efexre^pararneters for the invention. 
Utility of the Invention: 

The invention can identify the genetic grouping an unknown organism belongs to even if no perfect match 
is found for the organism of interest, (the "target"). The invention designs a set of probes that allows one to 
approximately position any target organism on a tree that displays the genetic relationship between the 
various organisms. With the invention, it is not necessary to know what organism or group of organisms 
one is looking for nor isitrieceia^thatit^^ science. Ultimately, even if 

nothing matches, the invention nonetheless gives useful information. For example, it might be learned that 
the unknown organism belongs to the group of enteric bacteria but is not any of the jpiown species. Using 
the invention, it is straightforward to generate a clear file with the five best signature quality values; in the 
format of Table A. The five best signature quality scores for the indicated sequence are listed with the 
specific node in the phylogenetic tree. 

Unanticipated problems involving microorganisms occur in a variety of settings including space flight, 
medicine, indoor air quality, bioweapons of mass destruction, epidemics, etc It would be of value to have a 
diagnostic system that could readily identify what microorganism is present regardless of prior expectations 
of what might be found, so as to facilitate a rapid assessment of what is occurring prior to choosing of 
countermeasures. It is especially essential to determine the genetic identity of the organism that is causing 
the problem as closely as possible, since this will clarify where the organism came from, what treatments 
are likely to be effective, etc. 

Fortunately, each 16S rRNAsequence contains^sk^s^seque^ices that are widely conserved throughout 
the dataset and despite the fact that there are now over 16,000 publicly available sequences, there are still 
large numbers of other sub-sequences, which are totally unique to, andhence characteristic of, a particular 
species or various groups of species that can be identified by methods of the invention. Surprisingly, this 
pattern of sequence conservatic^issastrong^thatit is pft^ihle to design specific oligonucleotide 
hybridization probes that can distinguish individual organisms, and groupingsjrf organisms in a tree of 
relationship definedby 16S rRNA. Otac^ar^app^^ signature sequences have been 

identified for a desired assay, appropriate probes can be designed Although his anticipated that probes 
based on the signatur e sequences- will be used directly, in some applications, the probes can be modified 
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before use. For example, a tc wildcard" base such as inosine might be used to extend or even modify the 
specificity of a probe. Moreover, two nearby probes might be combined to make a farger probe. Any of a 
variety of formats can be used to implement the assays. Thus, the final analysis system may utilize PCR- 
ampiified nucleic acids or, because rRNAs are typically present in many thousand^ of copies per cell, just 
the sample RNA alone. A variety of detection systems can be used, comprising fluorescence, 
chemiluminescence and isotopic detection. The resulting assay is highly compatible with hybridization 
array technology (DNA microarrays), which will allow the simultaneous assay of all the nodes in the 
underlying tree in one experiment Thus, it is possible to replace many tests with just one. 
It is inherent in the prior art that only predetermined microorganisms or groups of microorganisms will be 
detected This reflects the fact that prior, art assays axe based on prior identification^^ specific probes for 
the intended application. It is widely believed that a microbial detection system cannot be designed without 
prior knowledge of what is to be detected The invention described here implements a novel approach to 
assay design that overcomes this problem. 



Scientific pasis of the Invention 

Although the inventionis4u^tabelin^ed.b^any theory or by the way in which the, invention was 
achieved, the following may be helpful in understanding the invention. An extremely effective approach to 
determining genetic re1ate dne s s-amongi>acteria-is^to amplify and sequence their 1 rRNA genes (Fox et 
al. 9 1980; Woese, 1987). The resulting sequences are aligned with other 16S rRNA sequences and an 
appropriate meth^e.g. maximum construct aphylogeae^ic tree. This process is 

reasonably fast, very accurate and facilitated by programs and data available via the Internet at the 
Ribosomal Database Project (RDP) web site httrx//mw (Maidak et 

al. 9 2000). Many thousands of 16S rRNA sequences, representing essentially all known genera of bacteria, 
are now available in the REffi> andother ribosomal RNA databases. Therefore, when a new isolate of 
uncertain affiliation is found here on Earth, its genetic identity can be inferred from its placement in the 
16S r^NA phylogenetic tree. 

It was observed early on in the 16S rRNA literature that there were in fact many characteristic ribonuciease 
Tl (a subset of all possible oligonucleotides that consists only of those which end in G and contain no 
internal G) "signature" cJigpiuiclfiatid^ of such signature 

oligonucleotides in a set of 16S rRNA sequences actually reflects the fact that certain individual positions 
have a particular value (i.e. A, C, G or U) in all organisms belon^ngto a particular, cluster and a different 
value for organisms which do not belong to the cluster. The phylogenetic breadth of the cluster 
encompassed is different for each signature position and the signatures are typically spmewhat noisy in that 
the characteristic nucleotide is absent in some organisms that belong to the cluster of interest and present in 
some organisms that are outside the cluster. The information that is carried by thes^ very informative sites 
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is nevertheless precisely what underlies the success of standard algorithms tha: construct piryiogeneuc 
trees. 

In order to quantify this information, a signature quality index, which ranges from Q (no meaningful 
signature) to 1 (perfect signature) was developed for use with the ribonuclease Tl oligonucleotides (McGill 
et aL, 1986). Such an index allows the quantitative characterization of the utility of any oligonucleotide in 
determining if an unknown organism belongs to any particular genetic groupingin a particular tree of 
genetic relatedness. In order to implement the invention it was necessary to modify the signature quality 
function for use with complete sequence data The signature quality index used is of the following type: 

Q 8 =( J f s )x(i-%) (l) 

where Q s is a measure of signature qualify, % is the frequency of the signature sequence within the group 
under consideration, and% isth&fcequenc^ofthe-sig^ sequence outside the gpup of interest. The 
frequencies are based on the number of sequences in the dataset that a particular oligonucleotide matches 
and the resulting function again varies from 0 (no meaningful signature) to 1 (perfect signature). 

To illustrate this function, consider a particular heptamer, which is found in 50 distinct sequences, if 40 of 
these occurrences are in a singLe4axonoiuic cluster, which contains 5 Q nxembers and the remaining 10 
occurrences are scattered among the remaining sequences the resulting value of Q s is 0.64. Finally, the user 
of the invention needs^understaiuLthat when m^iber^ erf ajsequence-cluster share an oligonucleotide 
which is not found in non-members of the cluster (e.g when Q s is high) the oligonucleotide in question will 
almost always be found to occur- in the equivalent place in alLthe 16S~rRNAsthat h^ve it. This reflects the 
fact that useful signature sequences are phylogenetically conserved at various levels of genetic relationship. 
This is not obvious becauseitinitialfyseemsvei^ counterintuitive. Itis^however, the reason high quality 
signature oligonucleotides exist. If this were not the case the various oligonucleotides would be randomly 
scattered throughout the various sequences and high values of would be uncomrnon and not predictive 
of what would be found in sequences that were not yet known. 

It is also important to realize tliat- there areinany alternative ways in which the^sign^ture quality function, 
Qs, is defined One for example might take the logarithm of values or use values of 1- More to the point 
one could square the first facto inJEquation 1 t^give more weighton.any £al§e negatives or cube the 
second factor to strongly penalize false positives. 

What size of oligonucleotides^wiUgive usefiil signatumirj£annatiorr2 In thexase of shorter small 
sequences, the equivalence of position is overshadowed for small oligonucleotides such as the 4,096 (4 6 ) 
different hexamers, many of wiiich can be expected to occur by random chance amqjiig the 1,500 hexamers 

i 
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that one expects to find in a single 16S rRNA sequence. Thus, the heptamers (4* = 16,384 in total) 
represent the smallest sequencelengtkthatislik^^ information. On the 

opposite side, large oligonucleotides tend to be unique to individual organisms. That is to say, as 
oligonucleotide size increases^ larger portion of the signaturesLwill.be for leaf no^es, e.g. small numbers 
of closely related organisms and a decreasing percentage will signify internal nodes. Based on prior 
experience with 1 6S rRNA ribonuclease XL oligonucleotides, itislikely that sequences larger than length 
15 will mainly have utility for leaf nodes. 



Design and implementations 
Programming language 

Except the first program readseq^w&ich ispreinstalledas abinary executable^ otlier programs developed 
for this project were written in Perl 

Perl is a freely available^ nonproprietary, open-source prog]^mmingJangjiage.-Tliu^ programs written in 
Perl will not be affected by possible future changes in the license of the language compiler/interpreter. Perl 
is also a very high-leyeL languagefor general purposes, it has 4 function points per 100 lines of code, 
compared with 0.8 for C and 2 for C++. This means that software development in Perl is generally much 
faster than that in most other progrannning.languages. BerLisespetially^cien^ in dealing with text, 
which makes it an appropriate choice for manipulating genetic sequences. In addition, Perl's excellent 
built-in data structures, automatiagarbage collection, andaknost unrivalled portability also make it more 
attractive. 

More information on PerL and its newest release can be foundat the Perl web site: h|tp://www.perl.com.2.2 
Data structures. 



All Perl built-in data structures, namely scalar, array, and hash, are used in this inyentioa Because of the 
complexity of the data presentations, more sophisticated data structures such as bi-directional binary tree 
and composite hash, are also used 

Given the characteristic structured the phylqgenetic tree^itwa^natur^.tcuspre^^ it as a binary tree in 
the program. In this case the tree structure is special in that it is bi-directional. The parent tree node has a 
pointer to each of its two chMtree node^andtliechild tree node also has-a pointer back to its parent tree 
node (Figure 1). This unusual tree structure is required to facilitate the signature quality index value 




Each leaf tree node has-five datafields: "shortName", c "MlName'V"leafNumber w , ^sValid", and 
"isMatched" (Figure 1). The first two fields hold the abbreviated name and the full name of the prokaryote. 
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ieafNumber records the sequentially assigned number of the leaf node in the tree. The lasi two are Boolean 
variables used mainly for calculationpurposes. Each branch treenode has four data fields: "nodeNumber" 
"nuniLeaves 55 , "numVaiidLeaves", and "numMatchedLeaves" (Figure 1). The first field records the 
sequentially assigned number of thehranch.tree node The other fields record the mmiber of leaves, 'Valid" 
leaves, and "matched** leaves descended from this branch tree node respectively. 

Figure 1 shows thehL-dkectionalbinaiy tree structure with three leaf nodes. Note that a parent node has 
two pointers to its child nodes and each child node has a pointer back to its parent 

A composite hash was used to store all the oligonucleotides of a specific length derived from a dataset of 
the prokaryotic 16S rRNA sequences and their related information. The "infrastructure" of this composite 
hash was implemented withPerl's.builtrin hash. Because of the complexity of the information on each 
oligonucleotide, an anonymous hash data structure was heavily used to accomplish the task 

In Perl, a hash is composed of the umquakeys^th^ keys of the outmost 

layer of the composite hash are the sequences of the oligonucleotides and the value of each key is an 
anonymous hash which has-thiee keys.- "maichLr^n^ "^reeNodeValues". The 

value of "matchingTimes" counts how many times the oligonucleotide occurs in the 16S rRNA sequence 
dataset. The value of "matcliedQcg!lis.the set c£ thenames<3£the-c»gai^swho^e 16S rRNA sequences 
are matched by this oligonucleotide. Because of the special nature of the hash - that is, its keys must be 
unique - the set is also lmplementedwith an anonymous hasli. whose keys are the m?mes of the matched 
organisms and the corresponding values are set to "undef \ The value of "treeNodeyaiues" records the five 
highest quality index values^ the4ara^ hash whose keys 

are the branch tree node numbers and the corresponding values are the quality index values (Figure 2). 

Figure 2 shows the elaborate structure of the composite hash used in the program. Only two entries are 
shown in this figure. A hash is represented by a table and the keys are shaded 0 denotes the data type 
K< undef 7 in Perl The data in this hash are for elucidatory purposes only. 

Algorithm: 

The signature quality index measures how well an oligonucleotide (probe) signifies a taxonomic group of 
prokaryotic organisms in the phylogenetic tree. Thus, the index qualitatively measures the "quality 55 of the 
signature sequences and ranges from 0 (no meaningful signature) to 1 (perfect signature). The index can be 
mathematically expressed as: 

Qs =(^)x(l-°4) (1) 

where Q s is a measure of signature quality, % is the frequency of the signature sequence within the group 
under consideration, and°Us.theftequency of the signature sequence outside the ^roup of interest 
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Given a defined group of prokaryotes, % and °f s can be empirically described as: 

= N GM /N GT (2) 
X =(N M -N GM )/N M (3) 

where N M is the number of probe-matdied.prokaryotes in the entire tree, N GM is |he number of probe- 
matched prokaryotes in the group of interest, and N GT is the number of prokaryotes in the group under 
consideration Interpolate equation (l)with-equatiQ^s (2) and (3), we have: 
Qs = (N GM /N GT )x(l-(N M -N GM )/N M ) 

= (N GM 2 )/(N OT xN M ) (4) 



Preferably, the invention uses equation (4) to calculate the signature quality index Q s and in order to do so 
during run time it keeps traclringJsr^ N^, and N M c£ every oUgcmudeotide^f-a^pecific length at every 
internal tree node. Since equation (4) is derived from equations (1), (2), and (3), if any one of these three 
equations changes, whichmay occur hmed nn new insight into how characteristic signatures occur and are 
distributed in 16S rRNA sequences, equation (4) will change accordingly. This great flexibility provides 
system improvements that are included in the invention. 

System implementation 

The identification systemusedJ^findcha^ in_thei6S rRNA sequence dataset 

consists of the following twelve principal programs and several auxiliary programs, all provided on the CD 
enclosed with the application. 
Principal programs: 

■ readseq- (preinstalied program^noj written by the author) 

■ fasta2flat 

■ seq_classifier 
H treeparser 

■ selectseq 

■ probe_hash_table_generator 

■ calc_node_value 

s result_p^mter & result jpiinter_ 

■ group_node_lister 

■ list_hit_branch_nodes 

■ hybridize 
Auxiliary programs: 

■ nodeselector 

■ tree2newick 
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Figure 3 gives a panoramic view of the relationship among the principal programs and die dataflow m uiis 
system. This oligonucleotide identification system can be roughly divided into fou* functionally different 
subsystems, which in turn carry out sequence file format conversion, internal data structure preparation, 
function value calculation, and result presentation respectively (Table A). 

The unaligned prokaryotic 16S rRNA sequences were downloaded from the RDP irj Genbank format. The 
16S rRNA sequences are from those prokaryotic organisms that appear in the comprehensive prokaryotic 
phyiogenetic tree. Genbank format is the standard format for annotated nucleic acid and protein sequences. 
In this format, a sequence is recorded with several fields of information including its locus, definition, 
reference, and origin. Since only the abbreviated names of theorganisrns^and the 16S rRNA sequences in 
the sequence file are needed for the purpose of this project and all other information is redundant, it is 
necessary to extract the needed data from the sequence file and discard the extra in order to increase the 
program efficiency. 

This data extraction functionality isifnlfilledby subsystem I, the sequence file format conversion 
subsystem, which is composed of readseq and fasta2flat (Figure 4). Readseq is a preinstalled program. It is 
a convenient and useful utility to convert the format of a sequence file among Genbsjnk, FASTA, and many 
other formats. FASTA format is also a common sequence format and usually usedin sequence alignment. 
In this format, a right angle bracket (">"> prompts the sequence annotation on the^same line, which is 
followed by the sequence itself starting on a new line. This project used readseq to change the 16S rRNA 
sequence file from Genbank format to FASTA format In this step only the names of the organisms and the 
16S rRNA sequences are retained while all other information is discarded. 

Since the 16S rRNA sequence is long and expends several lines in FASTA format, it is not convenient to 
use the sequences in this format To further facilitate the manipulation of the 16S rRNA sequences and the 
corresponding organism names, the program fasta2flat takes the sequence file in FA^STA format as the 
input and rewrites the sequence data in a 'flat" format, in which every line is a data entry starting with the 
organism name, followed by a tab character f7t M ) as the separator followed by a string of letters (A, U, G, 
C), which is the 16S rRNA sequence. 

As shown in Figure 4, SubsystemLc^veils^efiirmaLof the sequence file. 

Subsystem II builds the binary prokaryotic phyiogenetic tree and the composite oligonucleotide hash. 

These internal data structures were usedtacalcuiate the function value at each branch tree node. 

Release 7 from RDP contains a total of 7,322 prokaryotic 16S rRNA sequences. However, not all of these 
sequences can be used ta generate the set of oligonucleotides-(please^ refer ta the section on program 
probes hash table generator for explanation on how the set of oligonucleotides was generated), because 
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many of them are only partial sequences of 16S rRNAs (e.g a sequence has only 3CC nt instead of about 
1,500 nt, the full length of l&S rRJslA) andmany contain positionsin the sequences, t^at have not been fully 

Is 

determined (i.e. if any position is noted by a letter other than A, U, G, and C). Program seiect_seq filtered 
out these problematic "invalid!' sequences^andretained l 7 92L!S?alidl' sequences th^t are fully determined 
and longer than 1,400 nt _ ; 

The comprehensive proJ^oticphylogenetic tree based upon 16S rRNA sequence^ in Newick format was 
obtained from the RDP web site. The Newick format for representing trees in computer-readable form 
makes use of the correspondence between trees and nested parentheses, noticedin 1857 by the famous 
English mathematician Arthur Cayley. A simple exemplary tree and its corresponding Newick format are 
depicted in Figure 5. 

As shown in Figure 5, the invention can form a phylogenetic tree and its corresppnding Newick format 
presentation. i 

The tree in Newick format suds, with a semicolon. Interior (branch) nodes are represented by a pair of 
matched parentheses. Between them are representations of the nodes that are immediately descended from 
that node, separated by commas. The tree-in-Iigure 7 has^six leaf nodes^t the^tips (A, B, C, D, E, and F) 
and five branch nodes inside (the root node and the branch nodes 1 - 4). A branch node can be at any place 
where a leaf node locate^ which rssultsin^^ any leve^. The comprehensive 

prokaiyotic phylogenetic tree has 7,322 leaf nodes and 7,321 branch nodes. Since the tree is far from being 
balanced (as the evolution of life itself is. not balanced), somebranches-of th^ tree go very deep. 

I 

The Newick format of the tree file obtained from the RDP website largely conforms to the Newick 
Standard described above with minor differences, such as the usage of comma and single quote. See Figure 
10 for an example. The tree file contains taxonomic group identifiers and branch lengths. Much information 
is also recordedfot every leaf node^ wihchincludes the abbreviatedorganism name, the full name, and etc. 
When the program tree_parser parses the tree file and builds the internal tree structure, only the abbreviated 
and full names of the organism are keptfor each leaf node andalLother information is discarded The 
abbreviated name is later compared with every name in the set of matched organisms of every 
oligonucleotide to det^mine4f d^leafnodeis^matchedby ^particular oligonucleotide. The full name is 
used purely for illustrative purposes whenever clear identification of an organism is necessary. Since this 
system does not use. taxonomic group-identifiers and evolutionary distances, these dpta in the tree file were 
also ignored 

Due to the algorithms and methods used to construct the phylogenetic tree, almost altjrfiyiogenetic trees are 
bifurcating, that is, a branch node has exactly two child nodes: a left node and a right node. This feature of 
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a phylogenetic tree makes a binary tree a natural and excellent choice of data strucuire to present it in a 
program. In some cases, the distinction between theTelati\«.branching.(M:cJers is very close and three or 
more branches are shown as emerging at the same node. Such nearly bifurcating trees are not a problem for 
the method as they are readily reduced taa bifurcating tree. Thetree jQle in Newick format is parsed in a 
stepwise and bottom-up manner. Program tree_parser scans the tree file and add one leaf node a time to the 
nascent internal tree facilitated^ a stack of references. Figuce 6 shows how a simple internal binary tree is 
built step by step (the reference stack is not shown). 

Figure 6 shows how the tree file, in Newick format is parsed in a stepwise and bottom-up maimer, (a) A 
phylogenetic tree in Newick format, (b) The internal tree structure is built stepwise and from the bottom up. 
The filled circles denote4ea£ nedesand the hollow circles branch nodes. 

Program treejparser builds the internal comprehensive prokaryotic phylogenetic tree using the tree file in 
Newick format as thebtuq^uitand^eri^iz^ifc to-anextemal .binary file5SU_Pr0ktree.bin for possible 
later use. It then marks the leaf nodes in the internal tree structure 'Valid" or "invalid" according to the 
names of prokaryotes in_ftle SSU_Praki^a converted valid the output of program seq_classifier, and 
serializes the marked tree to file SSU ProktreeMarkedTotaLbirL This tree structure can be used later to 
calculate the function values, but the process is inefficient because nearly 74% of the leaf node sequences 
are not of the very highest quality. The tree is large and the existence of invalid lea^ nodes makes its size 
unjustifiable. Another difficulty is that some taxonomically different branch nodes may actually represent 
the same group of valid descendant leaf nodes. 

These potential difficulties were avoided by using a representative tree based on only the highest quality 
sequences. Building such a representative tree requires a comprehensive analysis of the existing published 
tree of 7,322 sequences to determine which groupings and individual sequences, e.g known pathogens, 
need to be included This representative tree met these %ee qualifications: 

■ It only contains bacteria whose 16S rRNAs have been fully sequenced 

■ At least one organism represents each major taxonomic grouping. 

■ The topology of this representative tree should conform to that of the comprehensive tree. In order 
to construct a representative tree^.929 bacteria are selected from 1,921 prokaryotes pilose 16S rRNA 
sequences are of the highest quality. The list of the leaf node numbers of these 929 prokaryotes was kept in 
the text file selectedJeaf_nodeJist. The resulting representative tree is far more comprehensive than the 
98-sequence version provided RDP with its Release 7 dataset 

In order to keep the topology of the_representadvetree-^ that of tjie comprehensive tree, 

after writing out the binary files SSU_Prok.tree.bin and SSU_ProktreeMarkedTotal.Bin, program 
treejparser used the list of selected leaf nodes in file selectedJeaf_node Jist as the-reference to "trim 
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away" (Figure 7) invalid and valid-but-unselected leaf nodes in the tree sirucrare, resulting in a 
representative tree with 929 valid leaf nodes. This trimmed tree structure was serialised to the binary file 
SSU_ProIetreeMarkedTrimmed.bin, which was later used in the signature quality index value calculations. 

Figure 5 illustrates that the trimmingis. stepwise and topology conserving 
Program select_seq takes three files SSU_Prokfastaconverted valid, selectedjeafnodejist, and 
SSU_Proietree.bin as the input and generates file SSU_Prok.fasta. converted valid^ selected as the output, 
which will be used to construct the composite oligonucleotide hash in the next step. Input file 
SSU_Prokfastaconverted valid is the output of program seq_classifier It contain? all "Valid" I6S rRNA 
sequences in a special W format File selectedJeaf_node_list keeps all leaf node numbers of the 
selected prokaryotes. SSUJProk. tree, bin is the binary file from which the comprehensive prokaryotic 
phylogenetic tree is retrieved The tree structure is used to index between the leaf node number and the 
abbreviated organism name in the corresponding L leaf node. The output file holds the 16S rRNA sequences 
of the selected organisms in the same format as SSU_Prok.fasta converted valid 



Program probeshasht^le^genera^ generating the campositejjash, which records the 

needed information for each of all occurring oligonucleotides of a specific length from the 16S rRNA 
sequences dataset. The pr ogra m take&.th&.p^l^l^igtfv (x) as4he.corninandline argument and implicitly 
open sequence file SSUJProkfasta. converted. valid selected to get the abbreviated names of selected 
organisms and their corresponding! 6S rRNA sequences. The roshfor~r#4^jrf length x is output as 
binary file hashForProbeLengtte.bin. 

Since only the oligonucleotides.occumngirj.th^ L6S~rENA sequences are considereji interesting, naturally 
all oligonucleotides and their initial cognate information used in this system are derived directly from the 
16S rRNA sequences. If we-consider the number rfalLpossible^igonudeotides of a ^pecific length, the 
computational saving by deriving oligonucleotides directly from 16S rRNA sequences is substantial Out of 
all possible 1,048,570 (4 i0 ) decamers^234^of ft^^ the dataset of the 1,921 ' Valid" 

16S rRNA sequences and 133,599 of them occur more than once. Only these 133,599 multi-occurring 
decamers (12.7% of aLl} are usedia the next step tacalcuiate the function- values since we are only 
interested in identifying the phylogenetic neighborhood/group of an unknown bacterium. By definition 
oligonucleotides thatarjeuniquexannat b^^ of a group. 



Program probes_hash_table_generator reads in the selected 16S rRNA sequences and for each sequence it 
excises oligonucleotides of tlie^>edfiedlengtliircmthe 5' end^-shirjtingjMie nucleotide at a time, to the 3' 
end (Figure 8). Since an oligonucleotide can occur in 16S rRNAs from several organisms and several times 
in one particular 16S rRNA. tiia occu^ of an oligonucleotide in the hash can 

only be equal to or greater than the number of the organisms (matchedOrg) whose I6S rRNAs it occurs in. 
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Figure 8 illustrates how the composite hash of the oligonucleotides is built from the 16S rRNA sequences. 



10 



m 



35 



At this point the system lias, completed the necessary preparative work, namely th^ sequence file format 
conversions and the data structure constructions. With those steps complete, the system is now ready to 
calculate the function value at eackbranch tree node. Subsystem HL, disjunction v^lue calculation 
subsystem, consists of only one program - caic node vaiue. It takes the probe length (x) as the command 
line argument and implicitly reads in the aurespandingiinary probaha&iLffleiia^orProbeLengthjc.bin 
and the binary tree fiie SSUJProktreeMarkecff rimmedbia 



For each multi-occurring oligonucleotide from the hash reconstructed from the binary hash file, leaf nodes 
in the phylogenetic tree are marked if this sequence occurs in the 16S rRNAs of the organisms at these leaf 
nodes. At each branch node the number of its descendent marked leaf nodes is counted by using the 
unusual backward pointers in the tree structure. The signature quality index values are calculated at all the 
|f| 15 branch nodes and then sorted ia descending order. The topfive highest valuer and their corresponding 
J branch node numbers are kept as the value/key pairs in the treeNode Values anonymous hash field of this 

\j probe in the composite hash. After the calculation is completed the result is output as a binary fiie 

O hashForProbeLengtluCalc.bin, which is essentially the same as the hashForProbeLengthxbin except that 

tlie treeNodeValues for each multi-occurring oligpnucleotide is populated with the calculation results. 

S 20 

Subsystem IV, the resuit presentation subsystem, reconstructs the composite probe f^sh and retrieves the 
calculation results from file hashForProbeLengthxCalc.bia It is the open end of the system: the calculation 
resuit can be analyzed andpresented in a variety of ways because any program, as long as it can reconstruct 
the composite hash from the binary file, can "plug into" the system via the subsystem IV and interpret the 
25 calculation results in it&awaway. Currently thissub&ystem consists afftveprogram^ (Table C). 

Programs result_reporter and result_reporter_, as their names suggest, are a pair of similar result-presenting 
programs. They both take the length of probe (x) as the command line argument reconstruct the composite 
hash filled with the calculation results from corresponding hashForProbeLengthxCalc.bin, and give a list of 
30 signature sequences with information on their quality index, their identified branch nodes, and the 
descendent leaf nodes as the output files. The only difference between these-two programs is that the 
former outputs the list of signature sequences sorted in descending order of the pode numbers of the 
identified branch nodes while the list output by the later is sorted in descending order of the signature 
quality indexes. 



Programs group_node_jister and iist Jiitjranchnodes present the resuit fron^ the perspective of the 
taxonomic groups. group_node_lister lists all identified branch nodes along with their corresponding 
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signature sequences of a particular length specified at the command line, list Juror aiielinodes takes a 
more ambitious approach. It gets all the calculation results of oligonucleotides from heptamer to undecamer 
from files hashForProbeLengthxCalc.bin 0 = 7-11) and collects the number of times that a branch node is 
identified by characteristic oligonucleotides of a specific length at signature quality levels 0.6, 0.8, and 1.0 
respectively. The analysis result of this program is the useful statistics which imply the relationships among 
the frequency with which a branch node is identified, the oligonucleotide length, and the signature quality. 

Program hybridize was used to test the usefulness of the characteristic oligonucleotides that the system has 

discovered so far. It takes a sequence file as the input in which every entry starts wit^i a label followed by a 

1 

tab character ("\t") as the separator followed by the actual 16S rRNA sequence. Although this program can 
use any reasonably good set of characteristic oligonucleotides as the hybridization probes., in this 
preliminary test nonameric signatures were used and they gave satisfactory results. When hybridize reads in 
a 16S rRNA sequence, it compares (^hybridizes") this sequence against all the characteristic 
oligonucleotides with a signature quality better than a specified threshold in the selected probe catalogue. 
When a probe is expectedtabiiidto the 16S rRNA it is recorded by marking the corresponding branch 
node in the representative phylogenetic tree. The output of hybridize is one marked representative tree per 
each unknown I6S rRNA sequence plus a signature quality threshold (0.6, 0.8, or 1.0). Some interesting 
and noteworthy features of the results will be discussed later. 

Valid 16S rRNA sequences 

The 7,322 bacterial 16S rRNAsequence^obtainedfrcMn RDPjreleaseJ have-multifarious qualities. Some 
were fully determined in terms of both the length and every position of the sequence while others are either 
partially sequenced and/or contain one or more undetermined positions. Any sequence that was either less 
than 1,400 nucleotides in length or has nucleotides other than AUGC (e.g. especially N standing for a 
position where the sequence^asuld^notbe^to^ by the system and was 

filtered away Many of these sequences had very minor difficulties, i.e. marginally shorter than required or 
containing up to 3 uncertain sequence assignments andcould have been used without significant effect. 
However, since 1,921 16S rRNA sequences met the strongest criteria it was possible to maintain the very 
highest standard Thu&oni> the sequences^deemedvalidwere retained to generate the sets of signature 
oligonucleotides. 

Although the two conditkms ^isqiia ii^dn^|^ol^ematial6S rRNA sequencesgfeatly simplify-how the 
system deals with low-quality sequences, they are probably far too strict and as a result the current 
calculations likely didnot make.nmimum,use-o£alLthe^qu^ in the dataset. Sequences a 

few nucleotides short of 1,400 nt or those that contain a small number of undetermined positions are 
currently discarded, eveathaugti their, signature-sequences remain mostly mtact. To mitigate this problem, 
the quality demands can be moderately relaxed, i.e. by lowering the length requirement and only discarding 
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the oligonucleotides containing undetermined positions instead of the whole ioS rRNA sequence. 
However, if a representative phyiogenetic tree is used instead of a comprehensive one (as in this system), 
the effect of losing sequence data should be mild since only a subset of 16S rRNA sequences are used 
anyway. If a branch of the comprehensive phyiogenetic tree is absent from the representative tree due to 
lack of valid 16S rRNA sequences in that cluster, either the quality demands can be decreased as described 
above or sequences from two very closely related organisms can be fused to ensure that this particular 
branch will be included Also, it should be appreciated that in some cases, the distinction between the 
relative branching orders may be very close in some areas of the tree. When this occurs it is not uncommon 
to show three or more branches emerging from the same node. Such nearly bifurcating trees are not a 

problem for the method as they are readily redu^d to a bifurcating tree. 

i. 

Oligonucleotides in 16S rRNA sequence dataset 

The number of all possible oligpnucliotides of a specific length evidently depends on both the length and 
how many different nucleotides are legitimate at each position. Given that there are four different 
nucleotides (A, U, G r CiamA^i<iA^T,a r CinDNA) 5 if the.length xrfthe^^igonucleotide is n, the 
number of all possible length-?? oligonucleotides is 4* When length n is large, the oligonucleotides 
occurring in the I6S rRNA sequence dataset are only a non-random fraction of ail possible oligonucleotides 
and there is no simple formula to calculate this number. Table D summarizes these numbers for 
oligonucleotides under consideration in this system from hexamer to undecamer. Jlpire 9 plots these data 
and gives a direct visual perception of the trends. 



Figure 9 shows that the number of oligonucleotides and the length are related (a) T^he number of all 
possible oligonucleotides increases exponentially with the length. The curve is described by function/^; = 
4 n . (b) The numbers of the total ^ulmulti-occums^^igsmucleotides in the 16S-r|lNA sequence dataset 
also increase with the length. The increases are slower than that in (a) due to the sequence context 
constraint from 16S rRNA. 

Signature oligonucleotides in 16S rRNA sequence dataset 

At a branch node in thepkylogenetiatree^ if an oligonucleotide gives a quality index value greater than a 
preset value, this oligonucleotide is said to be a signature at that branch node since it can identify that node 
better than other oligonucleotides which have a lower value-c£thexiuaH^index. In the current system, 0.6 
is the cutoff value, i.e. only oligomers with function value over 0.6 at a branch node will be presented in the 
results. 
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Of course, several signatures may identify a branch node and an oligonucleotide may also be a signature 
simuitaneously-at-sev^ralir^idi nodes. Clearly the togher . the quality index value of a signature at a 
branch node is, the better it can identify that node. A signature with a function value of 0.8 is better than 
one, with a function. value of (X 6 at the same-branch node and a signature with function value 1.0 is perfect 
for that node, which, according to the definition of the signature quality function, means that all 16S 
rRNAs^having thi s s ign atu re sequence-are in the same-phylogenetic group^definedtyy that branch node and 
thus no 16S rRNAs with the same signature are outside that group. 1 

Signatures of different lengths-are distributedin tte phylogenetic tree differently. Tfje general observation 
is that long and short signatures have polar distributions in the tree: the long signatures tend to identify the 
branch n o des nea r the4reeleaveswhile^ to pick outtfjqse near the tree root. 

This trend is evident when the results of pentameric and undecameric signatures are compared. The result 
shows that out of 35 (100%) perfect (Q,- 1 . 0) pentameric signatures identify, thf root while 11,958 out 
of 18,746 (64%) perfect undecameric signatures identify the two-leaves-as-rwo-chiidren branches. 

Short signatures, e.g. pentamers and hexamers examined by the system, are generally too unspecific to 
identify any interesting small groups in the phylogenetic tree with Q s .. They tend; to identify the whole 
bacterial tree instead However, if a smaller nucleic acid such as 5S rRNA is used then sequences of this 
length might be significant. On the other hand, long signatures, e.g. undecameric and longer 
oligoiiucleotides^ areincreasiiigly specific and therefore more usefol taideatify individual organisms and 
two-leaves-as-two-children groups. Signatures with a length between seven and eleven should have a more 
balanced distribution in the phylogenetic tree. 

2,533 nonameric signatures can identify phylogenetic groups with three or more (up to 23) members 
perfectly. OtL>a8and >a.6 quality levels there are 5,580 and 15*340 nonamerie signatures respectively. At 
this length, the signature sequences cover/identify -80% of the phylogenetic groups in the representative 
tree. The user can refer to Tabie E for a quick comparison. 

In Table E a fc6 gap" between the numbers of signatures shorter than octamers and those longer than 
heptamers. is evident On every level of si^iature-qualitks examined, namely where^Qs is equal to 1.0, 0.8, 
or 0.6, there is a sharp unexpected increase in the number of signatures and tree coverage from heptamers 
to octamers. 

Table Epravid^^ t0 un decamers 

and also 15-mers. Only signature sequences that can identify phylogenetic groups with three or more 
members are countedin constructing thistabLe A computer program isusedtacal^ilate the coverage. Any 
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branch nodes other than those that have two leaf nodes as their two child nodes in the representative tree 
are regarded as phylogenetic groups (615 in total). The-signature quality Q s is greater than 0.6. 

Illustrative Examples 
Example 1. A Local Region of the Tree & Its Associated Signatures 

The purpose of this example is to better illustrate the relationship between the signature sequences found 
and the nodes of the tree used in a more detailed level. Table F, lists only the results with reference to a 

10 local region of the comprehensive tree. Before trimming this region contained 16S rRNAs representing 38 
organisms. A total of 23 of these sequences were of the very highest quality but m^ny of them were very 
similar so a total of 12 sequences were selected for final inclusion in the representative tree. This local 
region of the representative tree is shown in Figure 12. The numbers of nonameric, undecameric and 15- 
mer signature sequences at each of the 1 1 branch tree nodes in this 12 organism sub-tree in different ranges 

15 of quality levels ace~,mmmari7ed in Table Tree^node.5547 doesnoLhave any signatures at the Qs 1.0 
level whereas its parent branch, node 5549, has 14 perfect nonameric/undecameric/15-mer signatures. 
Several of these are the same sequences, which serve as signatures for node 5547 at values of Qs at the 0.8 
level. This result draws attention to the fact that many individual oligonucleotides are signatures of several 
branch nodes at differing levels of Q^ This reflects the child/parent relationship between nodes. The 

20 signatures identifying the taxonomical group represented by the local root node 5577 of the representative 
tree illustrate another common feature. Of the 17 perfect signatures for node 5577, five are nonameric, six 
undecameric and six are 15-mers. However, every one of these five nonameric signatures appears as a part 
of one of the six undecameric signatures. This inclusion of shorter signature sequences is a part of a longer 
one is frequently seen regardless of the signature length, the signature quality level and the position of 

25 interest iij the phylogenetic tree. 

Example 2 

In silico hybridization 

30 

Once the characteristic oligonucleotides (signature sequences) from 16S rRNA sequence dataset are 
identified, they can be used to implement in silico hybridization (This is not carried out in the laboratory. 
Instead, it is performed virtually by a computer program,, thus, in silico). This procedure can be either 
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executed as a standard experimental routine or in this case as a quick test of the validity of the signatures, 
which have been identified. 

Since these characteristic oligonucleotides were derived from the selected valid 16S rRNA sequences using 
the corresponding representative tree, several valid 16S rRNAs that were not selected to make the 
representative tree were chosen as 16S rRNAs from "unidentified" bacteria Program hybridize was used to 
perform in silico hybridization between the unknown 16S rRNAs and the characteristic oligonucleotides. 
The unknowns were thus placed in their predicted phylogenetic neighborhoods in the representative tree. 
Because the comprehensive phylogenetic tree is available, thus the validity of the predictions could be 
quickly and definitively checked 

This in silico hybridization experiment was, get up. witk these thefollowing parameters: Probes length: 
9 (nonameric) and 1 1 (undecameric) quality level: 0.6, 0. 8, and L0 1 

I6S rRNAs control: Escherichia coli (E. colt) 
tests with the following valid sequences: 

Methanobacterium formiciaim (Mb.formici) 

Tetragenocuccus halophiles (Tgc.halop2) 

Orientia tsutsugmnushi (Ort.tsuts6) 
test done with followinginvalid sequence: 

the isolate M2 of the symbiont of methanogen<sym.M2) 
The four agents in this example are chosen in a random way with maximum distribution in the 
comprehensive tree. 

The results of this example are very promising All five bacteria, namely one omtrol and four test 
organisms, are placed in the correct phylogenetic neighborhoods. The correctness of the placements is 
confirmed by the positions of those five organisms in the comprehensive tree. 
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The control, E .coli at leaf node 7270 under branch node 7224 in the comprehensive tree, is unambiguously 
placed under branch node 7259 with E. coli (itself), E.colil, and E.colirnGS as three leaf nodes when 
probes at Qs 1.0 are used The best example of the four cases is probably Grttsuts6, which resides at leaf 
node 5404 under branch node 5383 in the comprehensive tree. This prokaryote was uniquely placed under 
branch node 5391 with Ort tsuts9 at the only direct leaf node 541 1 of this branch node. Another particularly 
noteworthy and interesting case is the identification of sym.M2. The sequence of ti*e 16S rRNA from this 
organism has only 359 nucleotides with one undetermined position. The correct placement of this 
prokaryote in the representative tree was possible because some signature sequences in its poorly 
sequenced 16S rRNA apparently remained intact and identifiable. 

Although the prokaryotic organisms, couidbe placed in correct clusters, there were ppsitive errors, i.e. some 
groups, which are not in the correct phylogenetic neighborhoods, were positivelyjdentified This kind of 
error occurs because many of the. signature sequences used have a vahie-cCQ, of less than 1. The number 
of these false positive errors decreased as the probe quality Qs increased from 0.6 to 1.0, but as to a specific 
organism and a specific, probe, quality level there was.no dramatic difiference in the error rate between using 
nonameric and undecameric probes. Despite this imperfection, one point should be stressed even though 
the false positives occur, the correct phylogenetic neighborhoods are among thegroups identified in all 
cases.Moreover, the correct neighborhood is readiiyidentifiedby the presence of multiple hits whereas the 
noise placements are frequen%-loners^ methorj, w&ich stems directly 

from the parent/child relationship between nodes in a bifurcating tree. Thus, false positives are not a 
serious impediment ta success. False negatives are also not ^problem because of the redundancy of 
signature sequences that occur at many nodes. 

This example shows that when a small set of 16S rRNA sequences are analyzed, at least some signature 
sequences exist that are representative of the phylogenetic groups that can be identified by tree 
constructions based on the complete 1 6SLrRN A sequences The consequence of having thousands of such 
sequences in the dataset was not known in the prior art Possibly noise would build up to the extent that 
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useful signatures would be obscured Even if such sequences continued to exist in the larger data set it was 
not clear that their numbers would be useful nor was it clear that they could be readily identified 



The results establish beyond any doubt that characteristic oligonucleotides in the bacterial 16S rRNA 
5 sequence dataset do in fact exist in. huge numbers. Over 15,0(KXnonamers alone vvjere identified, with in 
many cases multiple coverage of the various phylogenetic groupings in the 929 organism representative 
tree. 



It is invaluable to identify these .signature, sequences because agcoup of evolutionary related bacteria can 
10 be distinguished from other groups by a set of characteristic oligonucleotides specific to that group. The 
existence of these signaturesis^&direct demonstration of an innate characteristic of the evolution of 
bacterial 16S rRNAs that can be utilized to identify an unknown prokaryotic agent by elucidating its 

immediate phylogenetianeighboriioodJ^^ oligonucleotides can be used as the basis for 

D 

developing hybridization probes that can be used in order design valuable oligonucleotide microarrays. 
1 5 Herein the utility of thesignature sequences was teste&b^z» silica hybridizations usjng as unknowns 

sequences that had not been included in the original representative tree. These studies demonstrated that the 
characteristic oligonucleotides in. the unknown organisms readily provided their correct placement in the 
tree. 

This example by no means Lmiits the invention to characteristic oligonucleotides in 16S rRNA sequence 

1 

20 dataset. On the contrary, it encompasses many variations and specific improvements including, but not 
limited to the following: 

1. Use of new data available at RDP (both the newly released 16S rRNA sequence? of release 8. 1 and an 
updated prokaryotic phylogenetic trees). 

2. Improvements to the representative tree, e.g to provide that every cluster of prokaryotes in the 

25 comprehensive tree is represented by at least one bacterium in this tree. Where possible, merging of pairs of 
two closely related but not full len$h sequences to obtain a full length representation of that tree region 
may be possible. It also may be useful to better weight the number of entries from various clusters. 
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3. Use of different but sensible functions to calculate the signature quality index. Since the quality index is 
the most important tool for evaluating the signature potential of oligonucleotides in this system, changing 
the function can have a substantial impact on the specific result 

4 Assembling and use of a comprehensive set of characteristic oligonucleotides, by \vhich the majority of 
the groups and all of the important groups in the representative tree can be identified The oligonucleotides 
in this set are likely to have various lengths. 

5. Applying mathematical and programming techniques to facilitate the final interpretation of hybridization 
results . 



Example 3- Soil Samples 

16S rRNA is piirified^on^an unknawn organism. isolated from soil andampffied by RT-PCR using 
primers directed to conserved regions and flanking a variable region of the molecule. The PCR products 
III 15 are subjected to digestiaoJ^ a^estrictioaendcuuK^^ with cy5, and then hybridized 

Q to an array of all possible 8-mer peptide nucleic acids. After washing, the pattern of hybridization is 

Q observed by confocal laser ftuoaescencejscai^ the known signature 

; a , sequences for bacteria and the organism is assigned to the genus Nocardia. 

'%%$ 

w 20 Example 4 -Soil Samples 

16S rRNA is purified from an unknown organism isolated from soil and amplified by RT-PCR using 
primers directed to conserved regions and flanking a variable region of the molecule. The PCR products 
are subjected to digestion by a restriction endonuclease, fluorescentiy labeled with cv5, and then hybridized 
25 to an array of 5,000 DNA probes designed to recognize the 16S rRNA sequences of particular species. 
After washing, the pattern of hybridization is observed by confocal laser fluorescence scanning, and no 
significant hybridization is found The same labeled nucleic acids are then hybridized to an array of 4,000 
probes to bacterial signature sequences identified by the methods of this invention. After washing, the 
pattern of hybridization is observed by confocal laser fluorescence scanning, and interpreted in terms of the 
30 known signature sequences for bacteria and the organism is assigned to the genus Bacillus. 



Example 5- Air sample 
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Nucleic acids isolated from an air filtrate are aliquoted into 50 wells of a fluorescence microtiter plate, each 
well containing a S'-FITC, 3'-quencher molecular beacon hairpin probe specific for a selected signature 
sequence. After heating to 95C for 5 minutes, the plate is allowed to cool slowly to room temperature, and 
fluorescence i& read The pattern of fluorescence is compatible with the presence of a strain of 
Staphylococcus. That is closely related to a known pathogenic strain. 

Example 6 - Mutated Protease 

Nucleic acids of a virus are isolated and amplified from a blood sample andsignatur^ sequences are scored 
using the Qiagen Genomics Masscode sequence detection technology. The presence of particular signature 
sequences permits identification c£ attain bearings mutation of a^pmviousLy-jcnowii protease, which 
confers on it resistance to particular therapeutic drugs. 

Example 7- Meat sample 

Nucleic acids are isolated from a meat sample clamaedtabegooselivei^nd^naturp sequences are scored 
using the Third Wave Technologies Invader directed-cleavage assay. The presence of a particular signature 
sequence indicates the-presence-of iurke^ meat as an adulterant 

Example 8- Blood sample 

Blood taken from the bed of apickup truck oNMiedby a^suspectedpoacheri^ analyzed for signature 
sequences of mammalian mitochondrial DNA using individual h^ridization assays detected by 
chennluniinescenceproducedb^ malkaUnerphosphatase-conjugated I^A/DNArSpecific antibody. The 
results suggest the blood comes from an animal of the genus Euarcturos^ and the suspect is arrested on 
suspicion of poaching^ American black bear. 

Example 9- Air sample 

Nucleic acids isolated from an air filtrate are aliquoted into 50 wells of a fluorescence microtiter plate, each 
well containing a 5'-FITC, 3'-quencher molecular beacon hairpin probe specific for a selected 18S rRNA 
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signature sequence. After heating to 95C for 5 minutes, the plate is allowed to cool slowly to room 
temperature, and fluorescence is^read. The pattern of fluorescence incompatible wij;h the presence of both 
a mold belonging to thegenus Stachybotrys and a fungus belonging to the genus Aspergillus. Two DNA 
oligonucleotides (one 5' biotinylated) corresponding to two signature sequences^fou^d in the sample are 
5 used in a VCR reaction to amplify a segment (of predicted length 46 nucleotides, based on the positions of 
the signature sequencer iiLtJial6SjcRNA.seqpence) ofrDNA. Thabij^ylateiiK^uct is immobilized in 
single-stranded form and used as a probe for higb-aftlnity, high-specificity detection of a novel species of 
Stachybotrys. 

t 10 Example 10 

j£ Nucleic acids of a virus areisolatedandampl^ an^ signature nucleic acid 

W sequences are scored using the Qiagen Genomics Masscode sequence detection technology, night 

! 4 signature enzyme activitienar^alsaassayedfor^andtv^are-foiuut anA24 proteins ^vhose presence can 

i y , 

S| serve as signatures are assayed for by ELISA, and two are detected The combined presence of particular 

n 1 5 signature sequences^ activities, and pr^in^ petri^^ particular viral straia 

§4 
S ! 5 ? 

Example 11- Bioterrorism 

13 Air filtrate from a government building is, collected and nucleic acids isolated r^A is enriched using 

DNAse and RNA fragmented by heating Frobes specific to several known bioterrorism agents give 
20 negative results. Molecular heacQa-basedscQdag.of signature sequences^veals the presence of 
unexpectedly high concentrations of bacteria with genetic affinity to the genus Bacillus. Further 
investigation reveaU mengmeeredvai^t strain o£ anthracis, and the building, is evacuated It is noted 
that the prior art known to Applicants would fail to identify this engineered strain. 

25 

MODIFICATIONS 

Specific compositions, methods^or embodiments. discitssed are. intended tabe~only illustrative of the 
invention disclosed by this specification. Variations on these compositions, methods, or embodiments are 
readily apparent to a person of. skill in tlie art based upon the teachings of thia specification and are 
30 therefore intended to be included as part of the inventions disclosed herein. Particularly preferred species 
and ranges of parameters are partially summarized by Table G. 
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The nucleic acid sequences included in the database can be any ribosomal RNA, or a fragment thereof, or 
DNA encoding ribosomal RNA or a fragment thereof, or the DNA spacer region between rRNA genes; or 
either the genomic DNA or RNA of viruses, or artificial RNAs, or any functionalJRNA molecule such as 
RNAse P RNA that is found in a useful variety of organisms. The molecule actually detected may be one 
that has a sequence related to the molecule represented in the database, for example PCR, NASBA or RT- 
PCR products, derived from rRNA or rDNA. 

Once identified, signature sequences will preferably be used in the design of hybridization probes. In this 
regard, the set of unique sequences of various lengths are perfect signatures for the spcific organism that 
they are found in and therefore are obvious candidates for use in the design of specific hybridization probes 
for that organism. If a node is associated with multiple signature sequences, as many are in the case of 16S 
rRNA, it will be preferable to utilize the one or more with the most favorable hybridization properties. 
Depending on the experimental setting, the actual probe can preferably incorporate $ portion or all of either 
aparticular signature sequence or its complement. There are also obvious mathematical relationships 
between the signature sequences of different lengths. Thus, for example, a 16 base signature sequence that 
is perfect for node N will necessary show up in the 8 mer signature set as 9 different unique signature 
sequences for node N (i.e. representing positions 1-8, 2-9,3-10,4-1 1,5-12,6-13,7-14,8-15,9-16 in the 16- 
mer). Therefore, one will be able to combine signature sequences in some cases to serve as a starting point 
in the design of longer probes. Many signature sequences that do not share the typ^ of relationship 
described above may still be sufficiently near each other in the primary sequence that it will be possible to 
combine them to design a longer probe. This can be accomplished, for example, by including a -''wildcard" 
hybridization base such as inosine at certain positions. More generally, a variety of non-standard bases can 
be used to modify the hybridization properties of a probe based on a signature sequence. Aiso the 
properties of a signature sequence can be modified to adapt them for use with organisms represented by 
another node. Individual monomers, in probes^or other sequences derived from signature sequences can be 
modified to facilitate hybridization, or detection. This includes but is not restricted tb incorporation of 
fiuorophores, chemically-labile moieties, isotopes^ or halogen atoms. Modifications <?an be incorporated in 
the course of replication by DNA polymerase or RNA polymerase. Labels can be incorporated in the course 
o|FCR, RT-FCR or NASBA 

Detection can employ a variety of known methods, both those based on sequence-specific hybridization 
and otherwise. Hybridization can be to RNA or DNA, but also to peptide nucleic acids, locked nucleic 
acids, branched nucleic acids, cyclic probes, backbone-modified nucleic acids, and bpe-modified nucleic 
acids. Array formats (on single or multiple, e.g., bead supports) will often be valuable. Hybridization can 
lead to the capture of a labeled nucleic acid on asolid support such asabead, membrane, or array. Labels 
can be isotopes, chemically-detectable tags, liquid crystals, cleavable chemical tags, fluors, quantum dots, 
or enzymes such as alkaline phosphatase, ribozymes^ or peroxidase. Enzymes can produce heat, color. 
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fluorescence, chemiluminescence, precipitates, bioluminescence, changes in liquid crystalline order, or 
changes in nucleic acid structure. Hybridization can also lead to production of signals by seif-quenching 
probes such as molecular beacons, or by ribozyme activation, FRET pairs, or changes in plasmon 
resonance or similar interfacial optical phenomena, in mechanical resonant frequency, in redox activity or 
5 electrical conductivity, in electrophoretic or chromatographic mobility, in affinity for chelated metals, 

minerals, or antibodies or proteins, or in particle or molecular mobility. Robotic methods of preparation and 
microtiter plates can be employed with the invention to further automate multiple assays. 

The method of the invention is especially useful when the hybridization probes C9nsist of every possible 

j. 

10 sequence of one length. For example, there are 65,536 unique 65,536 octamers. The signature 

characteristics of every one of these octamers are obtained by the method of the invention for any nucleic 
acid of interest. When the nucleic acid being used is 16S rRNA or 16S rDNA the same array can be used 
u for any bacterial identification. If multiple organisms are present this will apparent as there will be 

13 conflicting signatures. Only the sample preparation procedure would differ. The same array can also be 

1 5 used with any other nucleic acid Hence by changing the nucleic acid to the positiv^ strand genomic RNA 
of the flavivirus family, the experimental results would be useful in identifying the closest known genetic 
I U relatives of the test virus in this virus group.lt is an important aspect of the inventioij that it is not necessary 

„2 that all the oligomers in the array need work properly. There is frequently ^ high redundancy of signature 

n sequences associated with a particular node so that if several fail the node will still give a signal if it is 

fcJ 20 represented in the sample. 



m 



Although signature sequences will be preferably be used in conjunction with hybridization methods of 
various types, it should be noted that these sequences also have unique physical properties. Therefore, if a 
plurality of signature sequences are generated by experimental means, e.g^by digestion with ribonuclease 

25 Tl or a restriction endonuclease, these physical properties can be measured. Mass spectrometry which can 
comprise matrix-assisted laser desorption ionization (MALDIj or electrospray or TQF or resonance 
methods can be used to determine mass within 10%, more preferably 2% and most preferably 1% for each 
sequence. Likewise applications exist where signature sequences can be used in the design of PGR primers 
to amplify larger regions of DNA or RNA. For example, a completely unknown organism is detected by the 

3 0 method of the invention and best assigned ta a^large early branching group. The proves that detected this 
affiliation could then be used as amplification primers to readily obtain a large region for full sequencing or 
as a longer probe. 



Although the invention is^preferred for use witlx functional nucleic acids it can. aly> be used with DNA 
35 sequences such as genes that encode protein. In this case, a database of genes for the equivalent protein 
from a sufficient number and variety of organisms, or viruses would be needed The tree used might be 
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deduced from the genes themselves but in order to avoid possible complications of lateral gene traiisrer it is 
preferable to use a tree basedoiLl6S rRNA sequence data 



When the invention is used with viruses, it is necessary to appreciate that all viruses do not share a singie 
5 common ancestor. There-are. many distinctgroups of viruses^eg. tfteHaviviridae^which is a large family 
of single stranded positive sense RNA viruses that includes the causative agents of yellow fever, St Louis 
encephalitis, Japanese encephalitis., hepatitis Q_andDenguefever The genome is typically in the size range 
9,500-i2,500 nucleotides some with DNA genomes and some with RNA genomes. Several common genes 
exist and hence meaningful phylogenetic trees can be devdopedwhich.span.the entire group. Thus, it is 
1 0 possible to generate signature sequences that are specific for Dengue serotype type tt or Dengue in general, 
etc. The methods of the invention can be usedfbr any virusgroupas long asa meaningful tree can be 
produced. However, the sample preparation may require more steps. The different types of nucleic acid 
I ri involved (single strand positive sense RNA, double strandedDNA etc) may limit the number of viruses 

0 groups that can be detected in one experiment. 

^ 1 5 Features preferred withthe invention iiLce>irtam cases comprise: tlie nucleic acid is DNA that encodes 
U\ _ I 

ribosomai RNA or a fragment or a complementary sequence of the foregoing; the nucleic acid is RNA 

1 U complementary to one of the strands of tbeDNAihatis^hrthe^pacer-regian between ribosomai RNA genes 
; «4 or a fragment of the foregoing; the nucleic acid is DNA isolated from the spacer region between ribosomai 
s< RNA genes or a fragnienL(£lhefar^^g;JhemKlftic acid is any noaanRNA produced by the cell or a 
M 20 fragment of the foregoing; the nucleic acid is any mRNA produced by the cell or a fragment of the 

foregoing; the nucleic acidis gjenoniieDNA. or a fcagmentof the foregoing; thesignature quality index Q s 
includes terms that weight against false positives and false negatives; the tree contains some multiple 
branchings but is suh?;tatitially^hifiircatins; thegeneticaffinity otbacteria of eukary^tic organisms is 
determined;the genetic affinity of more than one bacterial or eukaryotic organism can be determined in a 
25 single experi ment; wherein the nucleic acidis ON A that encodes- ribosomai RNA or a fragment or a 

complementary sequence of the foregoing; the nucleic acid is RNA complementary tq one of the strands of 
the DNA that is in thespacer region between ribosomaLRNA genes or a_fragment of the foregoing; the 
nucleic acid is DNA isolated from the spacer region between ribosomai RNA genes or a fragment of the 



30 



foregoing; wherethemjdeie.acidisany m of the foregoing. 



Other preferred features comprise: the nucleic acid is any rnRMA produced by the cell or a fragment of the 
foregoing; the nucleic acid is genomic DNA or a fragment of the foregoing;the genetic affinity of more than 
one virus can be determinedin a single experiment- the nucleic acidis a ribosomai RNA or or a fragment or 
a complementary sequence of the foregoing;the nucleic acid is DNA that encodes ribosomai RNA or a 
35 fragment or a complementary sequence of the foregoing 

the nucleic acid is RNA complementary to one of the strands of the DNA that is in the spacer region 
between ribosomai RNA genes or a fragment of the foregoingUhe nucleic acidis any non mRNA produced 
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by the cell or a fragment of the foregoing, the nucleic acid is any mRNA produced by the cell or a fragment 
of the foregoing;the nucleic acid is genomic DNA or a fragment of the foregping;th^ signature probes are 
of not all of the same length;the signature probes represent signature genes; choosing a tree of relationships 
that can be reasonably expected to signify genetic relationship was previously published or otherwise 
generated by a third party; the hybridization probes are complementary or the same sense as the signature 
sequences;a plurality of signature sequences is combined into one or more larger hybridization probes;a 
hybridization probe incorporates a portion of the information in a signature sequence;the signature probes 
are comprised of a nucleic acid analog comprising PNA, 2'-0-methyl DNA or analog thereof;the presence 
or absence of a signature sequence in a test sample is determined by physical characterizationthe signature 
sequences are identified by the method of claim 1. 

physical characterization is done with mass spectrometry^ nucleic acid molecule is a DNA molecule; 
the DNA molecule is a cDNA molecule. 

The invention may also be applicable in unexpected situations. For example, there are currently a large 
number of genomes being co mpl etely s equ e nced . When, one assembles phylogenetically meaningful 
clusters of whole genome sequences there are certain genes that are highly characteristic of particular 
clusters of organisms. These signature genes can be used in the invention to identify unknown organisms, 
preferably by detecting the presence of activities or gene products associated with the signature genes 
rather thpn a nucleic acid assay. 

What is claimed is: 
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Figure 5 

A phylogenetic tree and it&corxespondingJSfewick format presentation. 
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Figure 6 

Schematic illustration of bow a^treeiile (shawiLiirparta of thafigure) in Newiok format is parsed in a 
stepwise and bottom up fashion (part b of the figure). 
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figure 7 

Schematic illustration of the trimming] 
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Figure 8 

The composite hash of the oligonucleotides that is built from tl^ I6S rRNA sequences. 
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Figure 9 

The number of oligonucleotides and the length are related (a) The number of all pqssible oligonucleotides 
increases exponentially with the length. The curve is described by function f(x) = 4*. (b) The numbers of 
the total and multi-occurring oligonucleotides in the 16S rRNA sequence dataset als^> increase with the 
length. The increases are slower than that in (a) probably due to sequence constraint imposed by 1 6S rRNA 
structure and function. 
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Figure 10 The representative prokaryotic phyiogenetic tree 
in Newick format. 

(((((((((C<Msr.b^ str. 227 DSM 1538' : 0.13236 , *<Msp.hungat> 

Methanospirillum hungatei str. JF1 DSM 864 (T)' : 0. 16948 ) : 0.24421 , '<Hf.volcani> Haloferax volcanii 
str. DS-2 ATCC 29605 (Tj : 0.03648 ) : 0.09112 , (*<env.SBAR16> Santa Barbara Channel 
bacterioplankton DNA clone SBAR16' : 0. 19448 , '<Tpl.acidop> Thermoplasma acidophilum str. 122- 1B2' 
: 0.22004 ) : 0.04224 ) : 0.10775 , f <Arg.fulgid> Archaeoglobus Mgidus str. VC-l^DSM 4304 (TJ : 
0.04075 ) : 0.05544 , ('<Mb.formici> Methanobacterium formicicumDSM 1312 : 0.03067 5 '<Mtfervidl> 
Methanothemusfemdus' : 0.19624 ): 0.01978 ) : 0.0947 , *<Tc.celer>Thermococcus celer str. VU 13 
DSM 2476 (T) r : 0.00981 ) : 0.05532 , C<Mc.varmiel> Metlianococcus vannieiii str. EY33 1 : 0.02484 , 
f <Mc.jannasc> Methanococcus jannaschn str. JAL-I DSM 2661 (TJ : 0.1614 ) : 0.00857 ) : 0.02807 , 
'<Mpy.kandll> Methanopyrus kandleri str. avl9 DSM 6324 (TJ : 0.09845 ) : 0.02703 , ( # <env.pJP27> Mud 
Volcano area of Yellowstone NP ("Black Poor") hot spring DNA clone pjP2T- ; 0.06783 , 
(C<env.SBAR12> Santa Barbara Channel bacterioplankton DNA clone SBAR12' : 0.1046 , '<env.pJP89> 
Mud Volcano area of Yellowstone NP (''Black Pool") hot spring DNA clone pJP89^ : 0.28523 ): 0.01 132 , 
( , <Tm£penden> Thermofilum pendens str. Hw3 DSM 2475 (T)' : 0.04404 , ('<Sulacalda> Sulfolobus 
acidocaidarius str. 98-3 ATCC 33909 (TJ : 0.04024 , i <Thp.tenax> Thermoproteu^s tenax' : 0.15875 ) : 
0.02106 ) : 0.09273 ) : 0.20883 ) : 0.03789 ) : 0.31178 , (*<Aqu.pyroph> Aquifex pyropliilus str. Kol5a' : 
0.20649 , (('<Tt.maritira> Themiotoga maritimastr, MSB8DSM 3109 (TJ : p.01001 , '<Fer.island> 
Fervidobacterium islandicum str. H-21 DSM 5733 (T) 1 : 0.16351 ) : 0.23062 , ((('<Mei.ruber4> 
Meiothermus ruber str. Loginova 21 ATCC 35948 (TJ : 0.14908 , *<D.radiodur> Deinococcus radiodurans 
ATCC 35073' : 0. 19907 ) : 0.08298 , ('<Crx.aurant> Chloroflexus aurantiacus str. J-10-fl ATCC 29366 (T) 1 
: 0. 1976 , i <Tmc.roseum> Thermomicrobium roseum ATCC 27502 (TJ : 0.36297 ) ; 0. 1 1213 ) : 0. 01 165 , 
((((((((((((C<Acp.laidla> Acholeplasma laidlawii str. JA1' : 0.11002 , , <C.ramosum> Clostridium rainosum 
str. 1 13-1 ATCC 25582 (T) ; : 0. 30774 ) : 0. 00736 , f <M.capricoi> Mycoplasma capricoium ATCC 27343 
(T) [gene=rrnB] , : 0.38452 ) : 0. 10528 , '<Stc.therm3> Streptococcus thermophilus DSM 20617 (TJ : 
0.05073 ) : 0. 15065 , *<Bcafaecal> Enterococcus faecahs' : 0.0306 ) : 0.01738 , (SL.casei> Lactdbacillus 
casei subsp. casei ATCC 393 (TJ : 0.13937 , '<L.delbruck> Lactobacillus delbrueckii subsp. delbrueckii 
str. Calvert ATCC 9649 (TJ : 0,04809 ) ; 0.01852 ) : 0.02217 ? f <Lis.monoc3> Listeria monocytogenes' : 
0.02418 ) : 0.0404 , '<B.cereus4> Bacillus cereus IAM 12605 (T)' : 0.06989 ) : 0.0034 , ('<B.subtilis> 
Bacillus subtilis str. 168*' : 0.05051 , '<B.stearoth> Bacillus stearothermophilus NCE^O 1768 (TJ : 0.05959 ) 
: 0.0075 ) : 0.12658 , '<Eub.barker> Eubacterium barkeri ATCC 25849 (T) 1 : 0.28781 ) : 0.0097 , 
( f <C.quercico> Clostridium quercicolum ATCC 25974 (TJ : 0. 13519 a i <Hel.chipr2> Heliobacterium 
chlorum ATCC 35205 (T)' : 0.1075 ) : 0.01024 ) : 0.01183 , C<Fus.nuclea> Fusobacterium nucleatum 
subsp. nucleatum ATCC 25586 (TJ : 0.08593 , C<Stm.ambofa> Streptomyces ambefaciens* : 0.06051 , 
('<Cor.xerosi> Corynebacterium xerosis ATCC 373 (T)' : 0.10315 , ('<Bif.bifidu> Bifidobacterium bifidum 
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ATCC 29521 (TJ : 0.29842 , J <Arb.giobif> Arthrobacter giobiformis str. 168 DSM 20124 (TJ : 0. 12957 ) : 
0.06797 ) : 0.00748 ) : 0.3137 ) : 0.01738 ) : 0.0051 1 , ( f <Cleptum> Clostridium leptum ATCC 29065 (TJ 
: 0.16126 , ( ; <C.butyric4> Clostridium butyricum str. E. VX3.6.I NCIMB 8082*' : 0.06037 , f <C.pasteuri> 
Clostridium pasteurianum ATCC 6013 (T)' : 0.07626 ) : 0.38023 ) : 0.02432 ): 0.01262 , 
(((((((((C<Rub.gelat2>Rubrivivax gelatinosus str. ATH 2.2.1 ATCC 17011 (T) 1 : 0.07169 , '<Spr.voluta> 
Spirillum volutans ATCC 19554 (T? : 0.06664 ) : 0.00462 , '<Rcy.purpur> Rhodocyclus purpureus str. 
6770 DSM 168 (TJ : 0.04015 ) : 0.02165 , J <Nis.gonorl> Neisseria gonorrhoeae str. B 5025 NCTC 8375 
(T)' : 0.19789) : 0.01431 r '<Stejnaltop> Stenotrophomonas maltophiliaATCC ^3637 (T)' : 0.24098 ) : 
0.02299 , C<E.coli> Escherichia coii [geiie=rrnB operon] 5 : 0.05825 , i <Ps.aerugi3> Fseudomonas 
aeruginosa DSM 50071 (T)' : G.63646-) : 0.03524 ) : 0.04488 , '<Alm.vinosm> Allo^hromatium vinosum 
ATCC 17899 {Tj : 0.0233 ) : 0.04869 , 1 <Hrh.halcM>HaiorhodospirahaiocMori str. A ATCC 35916 (T) ; 
: 0.05948 ) : 0.08019 , (C<Rruhami3> Rhodospirillum rubrum str. ATH 1.1.1; S.l ATCC 11170 (T)' : 
0.04904 , { <Azs.brasi2> Azospiriilum brasilense str. Sp 7 NCIMB i i860 (TJ : 0.3086 ) : 0.01343 , 
(C<Ric.prowaz> Rickettsia prowazekii str. Breinl ATCC VR-142 CT) (alpha purple) bacterium)' ; 0. 1406 , 
f <Spg.capsul> Sphingomonas capsulata ATCC 14666 (T)' : 0.13872 ) : 0.02068 , C<^Wegum8> 
Rhizobium leguminosarum 1AM. 12609 (T)' : 0.015J6 r <'<Bdr,japoni> Bracfyriiizctoium japonicum LMG 
6138 (TJ : 0.05736 , '<Rm.varmiei> Rhodomicrobium varmielu str. BY33 ATCC 51194* : 0.093 ) : 0.04263 
) : 0.00617 ) : 0.0346&_) 1 0.06772) ; 0.00546 , (('<Myx^anthu> Myxococcufc xanthus str. DK1622 r : 
0.11263 , J <Dsb.postga> Desuifobacter postgatei str. 2 ac 9 DSM 2034 (TJ : 0. 19098 ) : 0.01 154 , 
0<Dsv.desulf>Desulfovibria^ ; 0.Q1563 , ('<Bde.stolpi> 

Bdeiiovibrio stoipii str. UKi2 ATCC 27052 (TJ : 0.05967 , C<Camjejun5> Campylobacter jejum subsp. 
jejuni str. TGH 9011 ATCC 41431'.: 0.01753-, C<Wlrmicci2> Wc^ndlasucciiiog.eries str. 602W (FDC) 
ATCC 29543 (Tj : 0.05551 , i <Hlb.pyior6> Helicobacter pylori ATCC 43504 {Tj : 6.02351 ) : 0.18884 ) : 
1.11671 ) : 0.18947 ) : <LG 1602)^0.15633 ) : 0,01513 r <(((((('<Tr|x^pallid> Treponema pallidum str. 
Nichols 1 : 0.14543 , '<Spi.stenos> Spirochaeta stenostrepta str. Zl ATCC 25083 (TJ \ 0.03623 ) : 0.03698 , 
, <Bor.burgdo>Borreliabur^ieilstr, B31 AXCC35210(T£ ; 03604 ) 1 0.0859 5 '<Spi.haloph> 
Spirochaeta halophiia str. RSI ATCC 29478 (TJ : 0.02473 ) : 0.01206 , i <Brs.hyodys> Brachyspira 
hyodysenteriae str. B204 ATCC 31212' 1 0.43546 ): 0.04129 r ('<Lpiillini> Leptonema illini str. 3055' : 
0.07041 , ^Lps.interJ^ Leptospira interrogans str. KennewicM, serovar ppmona 1 : 0. 16902 ) : 0.05013 ) : 
0.01817 , C<Fib.sucS85>Fibrobacter succinoaenes. subsp, succinogenesstr. §85 ATCC 19169 (TJ : 
0.23142 , *<Acbt.capsi> Acidobacterium capsuiatum str. 16r : 0.21099 ) : 0.03073 ) : 0.0094 , 
((((('<Syn.6301> Synechococoi^p. PCC63G1' : 0.12285 , ^NosUuusci^ Nostoc muscorum PCC 7120' : 
0.06977 ) : 0.01225 , C<Zea_mays_C> Zea mays (maize; corn; Indian corn) - chioropiasf : 0. 145 , 
'<01st.lut_0 Olisthodiscus.mteu^(stramenopile) -- cMoroplast , ^0J525 ) : 0.0^491 ) : 0.012 , 
'<Glb.violac> Gloeobacter violaceus PCC 7421' : 0.07279 ) : 0.01171 , C<env.MC18> Mount Coot-tha 
region (Brisbane, Australia) 5-10cm depth soil DNA.doneMC 18' : 0.01409 , f <Qhdpsitta> 
Ciiiamydophiiapsittacistr. 6BC ATCC VR-125 (TJ : 0.36004 , J <Fir.staley> Pireiluia staleyi ATCC 27377' 
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: 0.34247 ) : 0.25993 ) : 0.1121 ) : 0.0325S , C<Chl.limico> Chlordbium iimicoia str. 832f : -.1335 , 
C<Tnm.lapsum> Thermonema iapsum ATCC 43542 (T)' : 0.0332 , ('<Flx.litora> f lexibacter litoralis str. 
Lewin SIO-4 ATCC 23117 (T/ : 0.01576 , C<Cy.hutchin> Cytophaga hutchinsonii str. D465 (P.H.A. 
Sneath) ATCC 33406 (T)* : 0.0073 , ( - <Prb.dfflEhi> Persicobacter diffluens str. Lewin UM-1 ATCC 23140' 
: 0.00585 , ('<Sap.grandi> Saprospira grandis ATCC 23119 (T)' : 0.02768 , C<Flx.canada>Flexibacter 
canadensis ATCC 29591 (T)' : 0.03254 , (("<Bac.fragil> Bacteroides fragilis ATCC 25285 (T)' : 0.04826 , 
i <Prv.rumcoI> Prevoteiia ruminicoia subsp. ruminicoia ATCC 19189 (Tj : 0.20539 ) : 0.02821 , 
('<Cy.lytica> Cytophaga lytica str. LIM-21 ATCC 23178 (T)' : 0. 14365 , , <Emb.Jrevi2> Empedobacter 
brevis ATCC 14234 s : 0.0913 ) : 0.35994 ) : 0.12199 ) : 0.33291 ) : 0.47588 ) : 0.14622 ) : 0.18424 ) : 
0.08878 ) : 0.30465 ) : 0.05104 ) : 0.00825 ) : 0.02261 ) : 0.00329 ) : 0.56238 ) : 0.52312 ) : 0.05444 ) : 
0.31178 ); 
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Mgure 11 The graphic view ot the representative prokarvotic phviogenetic tree. 



010AUS of USPTO Customer No. 26830 



50/61 



Acidoceiia facilis [acc = taoil23 * 



5576 



— AcidiphillunL angustxp tAcdp.angu2j * 
5565 

AcidiphiiiunL acidophiiup fAcdp.acphlj * 



5575 



5577 



5557 



— AeidxphiliunL multivortim [Acdp.mitvr] * 
5573 

— Acidiphilium organovorupt [ Acdp . organ] * 

— Gluconacetobacter di.azotrophici^s [Gab.diaztr] * 
5556 

* — gluconacetobacter xyiinuf [Gab.xyisucj * 



5549 



-Acidomonas methanolica [Adm.metha2] * 



— Acetobacter pasteuriam^s [Aba. paster] * 
5543 

i — Acetobacter ace^i [Aba.acetiS] * 
— 4 5547 

. — Gluconobacter cerintjs [Sb.cerinus] * 
5545 

1 — Gluconobacter fratenri^. [Gb.frateur] * 



Figure 12 



A local region of the representative Ueefollawing.triiiiiiiing.from 3 & to 12 sequence^. The branch numbers 
in the representative tree are labeled in the picture and can be correlated with the results given in Table F. 
The complete repDesentateeLtreei^o^th^Cn thatis attached to this application. 



Table A. 
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Five best Q s scores for 15mers that occur at least twice in the 16S rRNA data set Files containing 
complete tables of this type are given for various sized test sequences on the CD that is included with this 
application. Sequences that never occur or are specific signatures of an individual organism are not 
included in these lists. (Only a representative portion of the sequence listing is shown here) 
Sequence NodeNum QualityValue 



AAAAAAACAGUCUCA 2815 0.5 

AAAAAAACAGUCUCA 283 1 0. 5 

AAAAAAACAGUCUCA 2836 0.44 

AAAAAAACAGUCUCA 2839 0.4 

AAAAAAACAGUCUCA^ 2865 0. 33 

AAAAAAAGACGGUAC 2064 1.0 

AAAAAAAGACGGUAO2072 0.67 

AAAAAAAGACGGUAC2107 0.29 

AAAAAAAGACGGUAC2108 0.10 

AAAAAAAGACGGUAC2137 0.07 

AAAAAAAUGACGGUA 3770 0.1 

AAAAAAAUGACGGUA 3069 0.1 

AAAAAAAUGACGGUA2027 0.07 

AAAAAAAUGACGGUA2023 0.07 

AAAAAAAUGACGGUA 1780 0.07 

AAAAAACAGUCUCAG 2815 0.5 

AAAAAACAGUCUCAG283 1 0.5 

AAAAAACAGUCUCAG 2836 0.44 

AAAAAACAGUCUCAG2839 0.4 

AAAAAACAGUCUCAG 2865 0.33 



etc 



OlOAUSof USPTO Customer No. 26830 



52/61 



TafoieB 



Organism specific sequences. Each of these sequences is uniquely found in the indicated organsism A file 
containing a complete table of this type for sequences of length 12 can be found on the CD that is included 
with this application. Similar-iists^of unique sequences can be generate^ for any length (Only a 
representative portion of the sequence listing is shown here) 



Sequence 

AAAAAAACCApU 

AAAAAAACGUGC 

AAAAAAAGUUUC 

AAAAAAAIlAAAA 

AAAAAAAUqAAG 

AAAAAAAUUAGG 

AAAAAAAUUUAU 

AAAAAACACGUC 

AAAAAACCAACC 

AAAAAACCAAUC 

AAAAAACCACUC 

AAAAAACCCUUC 

AAAAAACCGpCC 

AAAAAACCGGUC 

AAAAAACGUqCC 

AAAAAACUAAAG 

AAAAAACUqjGC 

AAAAAACUGACG 

AAAAAAGAAGCA 

AAAAAAGAGUGG 

AAAAAAGCCCAC 

AAAAAAGCCGUC 

AAAAAAGCCUUA 

AAAAAAGGOGGA 

AAAAAAGUUqUC 

AAAAAAGUUUCG 

AAAAAAUAAAAC 

AAAAAAUACUCC 

AAAAAAUAGAGU 

AAAAAAUAUGUC 

AAAAAAUCAfAA 

AAAAAAIICAAAU 

AAAAAAUCAAUC 

AAAAAAUCCAUC 



Organism 

Mmycoide6 

QspAZ3_Bl 

Buc.aphUso 

Buc.aphUso 

Nostmuscr 

Mfloccul2 

Buc.aphCvi 

Eub.cellu2 

C.argenti3 

C.subterm2 

B. pallidus 
Tms.chilns 
Nsp.marin2 
Trb.tumesz 

C. spAZ3_Bl 
Buc.aphCvi 
env.DA052 
env.OPB92 
Buc.aphCvi 
Mmlo.WX 
Pps.octavi 
Eub.rumina 
sym.Camnhe 
Buc.aphCvi 
Cow.rumin5 
Buc.aphUso 
Buc.aphUso 
str.l6SX-l 
M.caprico6 
Cam.graci2 
Acp.oculi2 
M. conjunct 
Mmlo.WX 
env.Aspo3 
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Table C 

The program subsystems and their functions and components. 



Subsystem Function 



Components 



Sequence file format conversion 



readseq 
fasta2flat 



II 



in 



Internal data structure preparation 



Function value calculation 



seqLclassifier 

treejparser 

select_seq 

|>robe_hash_tabie_generator 
calc node value 



IV 



Result presentation 



result_printer 

result_printer_ 

groupnodejister 

iisthitjbranchnodes 

hybridize 
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rabie d 

The numbers of oligonucleotides of different lengths. 



Oligomer 


Hexaraer 


Heptamer 


Octamer 


Nonamer 


Decamer 


Undecamer 


Length 


6 


7 


8 


9 


IO 


11 


N tp . 


4,096 


16,384 


65,536 


262,144 


1,048,576 


4,194,304 


Nti6S. 


4,096 


16,340 


57,023 


125,990 


}86,781 


228,995 




4,096 


16,324 


48,295 


76,376 


86,856 


91,652 



- , 

5 N tp . - number of total possible oligonucleotides of length n. 

Nties. - number of total ohgonucleotidesAonr selected valid 16S rRN£ sequences. 

N^ies. - number of multi-occurring oligonucleotides from selected valid 16S rRNA sequence. 
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Table E 

The number of signature sequences that were found at various quality levels as a ftmction of length 



Signature 


Number of signatures at quality level Q s 


t 


Phylogenetic groups 


length 


= 1.0 


>0.8 


>0.6 


coverage 


5- 


35 


482 


674, 


1.99 


6 


0 


371 


680 


4.29 


7 


4 


372 


1,170 


24.35 


8 


457 


1,722 


6,168 


65.39 


9 


2,533 


5,580 


15,340 


79.48 


10 


5,016 


9,212 


21,919 


82.39 


11 


6,788 


11,607 


25,869 


83.15 


15 


10,487 


16,629 


39,502 


86.37 



f Only signatures that can identify phylogenetic groups with three or more members are counted 
t The coverage is calculated^ ^computer program- Any branch nodes other than those that have two leaf 
nodes as their two child nodes in the representative tree are regarded as phylogenetic groups (635 in total). 
The signature quality Q. is greater than 0.6. 
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Table F. The numbers of nonameric, undecameric, and 15-mer signature sequences at different branch iree 
nodes (see Figure 12) in different ranges of signature quality level 



Number of Nonameric* Undeeameric, and fifteenmeric Oligonucleotides Sequences in 



■Him pipir 


1J& 


[0.8, 1.0) 


[0.6, 0.8) 


I 


9 


11 


15 


Z 


9 


11 


15 




9 


11 


15 


5543 


77 


12 


26 


39 


0 


0 


0 


a 


36 1 


5 


12 


19 


5545 


176 


26 


58 


92 


0 


0 


0 


0 


163 


25 


51 


87 


5547 


0 


0 


0 


0 


27 


4 


10 


13 


95 


10 


36 


52 


5549 


14 


4 


5 


5 


13 


5 


14 


14 


183 


24 


62 


97 


5556 


47 


6 


13 


28 


a 


0 


0 


0 


29 


2 


8 


19 


5557 


19 


1 


8 


10 


61 


9 


2S 


24 


us 


27 


36 


55 


5565 


298 


42 


99 


157 


0 


Q 


0 


O 


108 


24 


46 


38 


5573 


419 


42 


13d 


241 


0 


0 


0 


O- 


139 


32 


50 


57 


5575 


90 


12 


30- 


4a 


165 


24 


48 


93 


102 


23 


39 


40 


5576 


93 


11 


2& 


54 


134 


22 


49 


63 


154 


27 


47 


80 


5577 


17 


5 


6 


6 


61 


15 


21 


25 


109 


26 


41 


42 
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Table G 



Parameter 


Preferred 


More Preferred 


MostPref 


Input Sample 


Body Fluids (blood, urine, saliva, 
sputum, sperm, biopsy sample; feces); 
Agricultural Products (grains, livestock, 
vegetables, etc.); soil 2 air particulates; 
i v-iv proaucts, natural waters, 
contaminated liquids; surface scrapings 
or s w<3ubtug£>7 AiiifnahRNA, cqII 
cultures, virus-infected cultures, 
microbial colonies 


Body fluids, 
agricultural 
products, microbial 
colonies, PCR 

products 


Body fluids, 

PCR 

products 


Target or2anism<i 
per sample 


1-100 

1 


2-20 


1=2 


j Target sequence 
type 


j SSU RNAs, LSU-rRNAs, BS rRNA, 
spacer region DNAfrom rRNA gene 
dusters, R8S rRNA, 4.SS rRNA, 10S 
RNA, RNAseP RNA, guide RNA, 
telomerase RNA, snRNAs -e.g. Ul 
RNA etc, scRNAs, Mitochondrial 
DNA, 

Virus DNA, virus RNA 

PCR product, human DNA, human 

cDNA, artificial RNA 


16S rRNA, Virus 

RNA, Virus DNA 

rRNA gene cluster 
spacer region DNA 


16S rRNA 


Organism 


Bacterium* viras^ plant, animal, 
fungus, yeast, mold, Arehae; 
Eukyarafcesr; Spores; Ffe&i Human; 
Gram-Negative bacterium, Y. pestis, 
HIV1,B. anthracis^ Smallpox virus 


Bacterium* Archaea, 

eukaryOHC 

microorganisms 

virus 


Bacterium 


Nucleic Acid 


Oiromosoiiiai DNA; rRNA; rDNA; 
cDNA; mkDNA^cpDNA^aRNA^ 

rti lisirfiif i Tins A I .^a,; - .-~.il A .-. S/" >f 53 
piaamiu ur%£\. 9 ullguItUCieofiUes; t^V^JV 

product; Viral RNA; Viral DNA; 

restriction fragment; YAC, BAC, 

CQSfflMr 


rRNA, viral RNA, 
Viral DNA 


rRNA 
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equenee length 


20-20,000 


100-12,000 


500- 2,500 


Probe length 


5to2S0a 


7to2fr 


10 to 20 


Number of probes 


2-100,000,000 


20-100,000 


50-10,000 


Classification 
Level 


Kingdom; Phylum; Class; Order; 


Genus; Seecies, 


Genus, 
Species 


Family; Genus; Suedes: Subgrouns: 
Strain, Tribe, Serotype; Gram sfam 


Strain 


Utility 


CKmeatfiiagitostst Biodefen<^ 
Research; Adulterant Detection; 
Counterfeit Detection; Food Safety; 


Cfmicaf Diagnosis* 
Biodefense; 
Adulterant Detection 


Clinical 
Diagnosis 


Taxoflomie Classification; 
Emkonmentaf Monkerin^ 
Agronomy; Law Enforcement 




C l 

Sample 

preparation Agent 


acid* base, detergent, phenol, etbanol, 
isopropanoi, chaotrope, enzyme, 
protease, nuclease^ iroivmerase^ 
restriction endonuclease, detergent 


Polymerase, 

restriction 

endonuclease, 

phenoi 


Polymerase, 
phenoi 


Sample Preparation 
Pretreatment 


Filter, Centrifuge, Extract, Adsorb, 
protease, nuclease, partition, wash, 
leaehvlyse, electmnhoresis. oreetDitate 
germinate, Culture 


Filter, centrifuge^ 
culture 


Filter, culture 


Hybridization 1 
Media 


AflHeous^ buffer, solution containing 
formamide, zwitterion solution, heated 
solution; stcohotrsohrtiott 


Aqueous buffer, 
solution containing 
for4namide T teated 
solution 


Solution 

containing 

formamide. 

heated 

solution 


Cultivation Media 


LB, M9, blood agar, DMEM, calf 
Culture medium containing host cells 


LB, blood agar, 
Culture mAfjiiim 

containing host cells 


Blood agar 


Separation media 
for sample 
preparation 


Io» exchanger, r s jter 5 nltrafilter, depth 
filter, multiwell filter, centrifuge tube, 

hydroxyapatite, silica, zirconia, 

magnetic beads 


Ion exchanger- 
multiwell filter, 

■ rrirt* oKj I 1 / (\ri^wt* n-f .i 1 

affinity adsorbent, 
hydroxyapatite? 
Miica, magnetic 


Ion 

exchanger, 

silica* 

magnetic 

beads 


Q s Minimum 


0.5-1.0 


>0.7 


>0,9 


\: 


! 


>0.$ 


>0.9 
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ffl 



<0.3 



«U5 



Detection Means: 
(Probe 

Hybridization): 



Mass Spee.;fiuGreseeiice; 
Chemiluminesence; Enzyme Reaction; 
Radiochemical; Self-quenching Probe 
hybridization; Surface Plasmon 
Resonance; Totaf Intertiaf Reflectioxr 
Fluorescence; Liquid Crystals; 
Magnetic; Infrared; Array Detection 
Peptide Nucleic Acid hybridization; 
Branched ©MA hybridization; Redox 
Chemistry; LNA hybridization 



Detection Means: 
(Ncmhybridization 

Methods: 



Mass Spectrometry; Electrophoresis; 
Affinity electrophoresis; 
Chromatography, rfPLC; Neutron 
Activation Analysis 



ru 
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